This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/Headers/
-
lib/
-
Headers/
-
amxintrin.h
-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
2/3
Passes.h
-
lib/Target/X86/
-
Target/
-
X86/
-
CMakeLists.txt
-
X86.h
37/58
X86LowerAMXIntrinsics.cpp
2/3
X86TargetMachine.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
AMX/
1/2
amx-low-intrinsics.ll
-
O0-pipeline.ll

Differential D93594

[X86] Pass to transform amx intrinsics to scalar operation.
ClosedPublic

Authored by yubing on Dec 20 2020, 4:58 AM.

Download Raw Diff

Details

Reviewers

LuoYuanke
pengfei
xiangzhangllvm
craig.topper

Commits

rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation.
rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation.

Summary

This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
should improve fast register allocation to allocate amx register.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	410 ms	x64 debian > Clang.CodeGen/X86::amx_api.c
	50 ms	x64 debian > LLVM.CodeGen/X86/AMX::amx-low-intrinsics.ll
	910 ms	x64 windows > Clang.CodeGen/X86::amx_api.c

Event Timeline

LuoYuanke created this revision.Dec 20 2020, 4:58 AM

Herald added subscribers: nikic, pengfei, hiraditya, mgorny. · View Herald TranscriptDec 20 2020, 4:58 AM

LuoYuanke requested review of this revision.Dec 20 2020, 4:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2020, 4:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

LuoYuanke added a parent revision: D91927: [X86] Add x86_amx type for intel AMX..Dec 20 2020, 5:00 AM

LuoYuanke added subscribers: annita.zhang, LiuChen3.

Harbormaster completed remote builds in B83067: Diff 312970.Dec 20 2020, 5:49 AM

craig.topper added a subscriber: craig.topper.Dec 20 2020, 11:54 AM

craig.topper added inline comments.

llvm/lib/Target/X86/X86TargetMachine.cpp
415	I don't think you can detect O0 this way. A function can have the optnone attribute in the non-O0 pipeline and won't be optimized by the middle end. This can occur if you mix an 00 translation unit and an O3 translation unit in LTO and use O3 for the LTO pipeline.

LuoYuanke added inline comments.Dec 21 2020, 4:05 AM

llvm/lib/Target/X86/X86TargetMachine.cpp
415	@craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass and determine if the amx intrinsics in the function need to be scalarized?

Address Craig's comments and fix clang format issue.

Add test for fucntions without attribute optone.

Harbormaster completed remote builds in B83139: Diff 313085.Dec 21 2020, 5:51 AM

Harbormaster completed remote builds in B83138: Diff 313084.Dec 21 2020, 6:14 AM

Scalarize tilestore.

Harbormaster completed remote builds in B83364: Diff 313492.Dec 23 2020, 12:45 AM

Support tile_zero and fix bugs for tile_load and tile_store.

LuoYuanke added a subscriber: yubing.Jan 12 2021, 10:01 PM

Harbormaster completed remote builds in B84973: Diff 316320.Jan 12 2021, 10:31 PM

yubing commandeered this revision.Jan 28 2021, 12:10 AM

yubing added a reviewer: LuoYuanke.

Fix some bugs in lowerTileDPBSSD, lowerTileStore, lowerTileLoad

Herald added a project: Restricted Project. · View Herald TranscriptJan 28 2021, 2:15 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B86986: Diff 319797.Jan 28 2021, 3:40 AM

yubing added a reviewer: pengfei.Feb 4 2021, 11:57 PM

LuoYuanke added a reviewer: xiangzhangllvm.Feb 5 2021, 12:00 AM

LuoYuanke added a reviewer: craig.topper.Feb 5 2021, 12:02 AM

Would you rebase to see if the lit test failure is related to this patch?

Rebase and fix the bug in amx_api.c

yubing mentioned this in D96110: [X86] Pass to transform tdpbf16ps intrinsics to scalar operation..Feb 5 2021, 1:32 AM

Harbormaster completed remote builds in B88039: Diff 321673.Feb 5 2021, 1:42 AM

Strange. llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll can pass in my local machine.

yubing added inline comments.Feb 8 2021, 7:53 PM

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
78	Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero.

xiangzhangllvm added inline comments.Feb 9 2021, 1:12 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
357	I see you need force match bitcast then replace, add assert for no bitcast case
472	'\|' is bits or, use logic \|\|

pengfei added inline comments.Feb 9 2021, 1:45 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
434	`bool C = false`
440	We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast after e.g. x86_tileloadd64_internal. So we need to insert one bitcast as required.
471	Remove the `{}` for single line loop.
507	You can just return it by `return LAT.visit()`.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
61	Maybe we can use zero mask load in future optimization.

LuoYuanke added inline comments.Feb 9 2021, 4:24 AM

llvm/lib/Target/X86/X86TargetMachine.cpp
416	We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes.

xiangzhangllvm added inline comments.Feb 9 2021, 4:45 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213	In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0). Then can use vector to handle all of the them, let type legalization to split the type.

yubing added inline comments.Feb 19 2021, 9:46 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213	We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero.

Address the commments above.

Small fix for some code

yubing marked an inline comment as done.Feb 19 2021, 10:01 PM

Harbormaster completed remote builds in B90020: Diff 325152.Feb 19 2021, 10:49 PM

Harbormaster completed remote builds in B90023: Diff 325155.Feb 19 2021, 11:20 PM

LuoYuanke added inline comments.Feb 20 2021, 12:57 AM

llvm/include/llvm/CodeGen/Passes.h
496	Add comments to describe what the pass does?
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2	This seems wrong file name.
12	Type 'able'.
160	Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions.
251	Delete the dead code.
268	It should be in another line.
281	Better to be in a new line.
293	Better to be in a new line.
373	The name seems not good. Is "PreBuilder" better? And why we need two builder in the function?
378	Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation.
412	Is "PreBuilder" better?
416	Shift?
449	PreBuilder?
488	Do we iterate the instructions in topology order or in post order?

pengfei added inline comments.Feb 21 2021, 7:22 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
491	Should be better to use if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst->getIntrinsicID()) { case Intrinsic::x86_tdpbssd_internal: ...
502	ditto

Address comments above

yubing marked 13 inline comments as done.Feb 22 2021, 9:54 PM

yubing added inline comments.

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
488	It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll

Harbormaster completed remote builds in B90331: Diff 325670.Feb 22 2021, 10:47 PM

Fix some comments and commit message

yubing marked an inline comment as done.Feb 23 2021, 7:19 PM

yubing edited the summary of this revision. (Show Details)Feb 23 2021, 7:22 PM

Harbormaster completed remote builds in B90524: Diff 325964.Feb 23 2021, 9:59 PM

pengfei added inline comments.Feb 24 2021, 5:03 AM

llvm/include/llvm/CodeGen/Passes.h
496	transforms
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2	We usually comment as ===--- filename - description ---=== See `head -n1 llvm/lib/Target/X86/*.cpp`
52	Ctx
89	Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.
100	Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop.
167	Maybe we can just use cast to help to raise the assertion.
224	You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.
230	ditto
232	Should check it is V256I32?
233	ditto
289	eltc?
312	Is it necessary to insert the ResElt to VecC?
341	TileLoadStore
342	Forgot to remove?
344	ditto
388	ditto
392	ditto
llvm/lib/Target/X86/X86LowerAMXType.cpp
333 ↗	(On Diff #325964)	ditto
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
1 ↗	(On Diff #325964)	Better name it amx-low-intrinsics-no-amx-bitcast.ll
13 ↗	(On Diff #325964)	It seems the body block is not necessary
19 ↗	(On Diff #325964)	ditto. The lable `TILELOAD_SCALARIZE_COLS_BODY` even not been used.
31 ↗	(On Diff #325964)	I think cols.latch is not necessary either.

yubing added inline comments.Feb 24 2021, 6:50 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct: Cij =Cij+Ai1.B1j Cij =Cij+Ai2.B2j .... Cij =Cij+AiK.*BKj

pengfei added inline comments.Feb 24 2021, 7:37 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	But you don't need to update both C and D. Something like the psudo code should enough: for (k : K) Dij += Aik * Bkj; Dij += Cij

LuoYuanke added inline comments.Feb 27 2021, 4:49 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	Why do we need a template instead of passing a parameter `bool IsLoad`?

pengfei added inline comments.Feb 27 2021, 5:36 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	Bing thought template instantiation can avoid the condition code to turn into branch instructions.

LuoYuanke added inline comments.Feb 27 2021, 5:40 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)

yubing added inline comments.Feb 28 2021, 9:09 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
100	I think We can compose a follow-up patch for this optimization

address comments above

yubing marked 15 inline comments as done.Mar 1 2021, 11:21 PM

yubing added inline comments.

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	I change code into the following style, and it can also reduce inner loop's size: for (k : K) Cij += Aik * Bkj; Dij = Cij Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
13 ↗	(On Diff #325964)	In fact, ISEL PASS can merge basicblocks together.

LuoYuanke added inline comments.Mar 2 2021, 1:24 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C.

Harbormaster completed remote builds in B91505: Diff 327362.Mar 2 2021, 2:04 AM

yubing added inline comments.Mar 2 2021, 7:30 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	But your solution still need to update D so D's phi will be kept in the inner loops.

LGTM with some nitpicks 😊

llvm/include/llvm/CodeGen/Passes.h
496	transforms
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
10	operations
11	We always enable it. Also need to mention optnone.
47	Curly brackets are not necessary here.
83	Do we need to remove the successor? Isn't it still being dominated?
97	IsTileLoad
118	Use 1 directly?
122	Better use the same naming conversion, i.e. `ColLoopHeader`
126–127	Better to change the order, e.g. Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty = FixedVectorType::get(EltTy, 256);
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
1 ↗	(On Diff #327362)	I think we should move the files to llvm/test/Transforms/
llvm/test/CodeGen/X86/AMX/amx-type.ll
2 ↗	(On Diff #327362)	Why adding this? Is it O2 by default?

This revision is now accepted and ready to land.Mar 3 2021, 5:52 AM

LuoYuanke added inline comments.Mar 4 2021, 5:09 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
83	I think this is to remove edge from preheader to tmp, because we insert a loop between them.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
1 ↗	(On Diff #327362)	Not sure about it. Our .cpp code is under lib/Target/X86/ folder.
llvm/test/CodeGen/X86/AMX/amx-type.ll
2 ↗	(On Diff #327362)	I think this is to test with opt level 2 this pass do nothing.

Address pengfei's comments

LGTM too.

This revision was landed with ongoing or failed builds.Mar 5 2021, 12:02 AM

Closed by commit rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation. (authored by LuoYuanke, committed by yubing). · Explain Why

This revision was automatically updated to reflect the committed changes.

yubing added a commit: rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation..

This seems to break the build https://buildkite.com/mlir/mlir-core/builds/12026#91ec4dfe-542f-4312-92db-7d555f05ce06.

I could repro locally, reverting locally fixes the build.

Please address, thanks!

RKSimon added a reverting change: rG3fd2fa122059: Revert rG8198d83965ba4b9db6922b44ef3041030b2bac39: "[X86] Pass to transform amx….Mar 5 2021, 3:09 AM

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

In D93594#2606157, @RKSimon wrote:

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Thanks – I was just about to point out this broke downstream testing too.

Thanks all for reporting and reverting this.

Harbormaster completed remote builds in B92239: Diff 328408.Mar 5 2021, 4:55 PM

Thanks all for reporting and reverting this. I will do bugfix asap.

In D93594#2606157, @RKSimon wrote:

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Hi, @RKSimon @nicolasvasilache , it seems we haven't told libLLVMX86CodeGen.so.13git to link TransformUtils inllvm/lib/Target/X86/CMakeLists.txt, That's why we encounter buildfail.
But There is a strange thing which can be observed in build.ninja :
When I cmake with "-DBUILD_SHARED_LIBS=OFF", libLLVMX86CodeGen.a will still link lib/libLLVMTransformUtils.a.
When I cmake with "-DBUILD_SHARED_LIBS=ON", libLLVMX86CodeGen.so.13git won't link TransformUtils.
Is there any difference in build system for static library and shared library?

yubing reopened this revision.Mar 8 2021, 7:30 PM

This revision is now accepted and ready to land.Mar 8 2021, 7:30 PM

Fix buildfail when it is -DBUILD_SHARED_LIBS=ON

Harbormaster completed remote builds in B92791: Diff 329204.Mar 9 2021, 4:01 AM

This revision was landed with ongoing or failed builds.Mar 15 2021, 7:41 PM

Closed by commit rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation. (authored by yubing). · Explain Why

This revision was automatically updated to reflect the committed changes.

yubing added a commit: rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation..

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

In D93594#2628497, @nikic wrote:

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

Thanks, @nikic, I will fix it ASAP. Besides, How could I reproduce the regression?
Eh, I am asking these question because I think I should see if the repression can't be reproduced with my future bugfix.

@yubing In this case I would recommend building sqlite3.c from test-suite under perf stat and look at the instructions metric. For me the command looks like this:

perf stat CLANG_BINARY   -w -Werror=date-time -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DSQLITE_OMIT_LOAD_EXTENSION=1 -DSQLITE_THREADSAFE=0 -I. -MD -MT MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o -MF MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o.d -o MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o   -c ../MultiSource/Applications/sqlite3/sqlite3.c

You can generally get a build command using ninja -v sqlite3 in test-suite.

I can reproduce the regression. I'll help to fix it.

LuoYuanke mentioned this in D98773: [X86] Fix compile time regression of D93594..Mar 17 2021, 4:24 AM

The fix is uploaded at https://reviews.llvm.org/D98773.

davezarzycki removed a subscriber: davezarzycki.Mar 17 2021, 6:46 AM

LuoYuanke mentioned this in rGe64adc0b88c2: [X86] Fix compile time regression of D93594..Mar 18 2021, 1:53 AM

yubing mentioned this in rG113f077f808f: [X86] Pass to transform tdpbf16ps intrinsics to scalar operation..Mar 21 2021, 10:01 PM

Revision Contents

Path

Size

clang/

lib/

Headers/

amxintrin.h

2 lines

llvm/

include/

llvm/

CodeGen/

Passes.h

2 lines

lib/

Target/

X86/

CMakeLists.txt

1 line

X86.h

1 line

X86LowerAMXIntrinsics.cpp

531 lines

X86TargetMachine.cpp

8 lines

test/

CodeGen/

X86/

AMX/

amx-low-intrinsics.ll

198 lines

O0-pipeline.ll

4 lines

Diff 319797

clang/lib/Headers/amxintrin.h

	/===--------------- amxintrin.h - AMX intrinsics -- C/C++ -*---------------===			/===--------------- amxintrin.h - AMX intrinsics -- C/C++ -*---------------===
	*			*
	* Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			* Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	* See https://llvm.org/LICENSE.txt for license information.			* See https://llvm.org/LICENSE.txt for license information.
	* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	*			*
	*===------------------------------------------------------------------------===			*===------------------------------------------------------------------------===
	*/			*/

	#ifndef __IMMINTRIN_H			#ifndef __IMMINTRIN_H
	#error "Never use <amxintrin.h> directly; include <immintrin.h> instead."			#error "Never use <amxintrin.h> directly; include <immintrin.h> instead."
				Lint: Pre-merge checks Inline Actions clang-tidy: error: "Never use <amxintrin.h> directly; include <immintrin.h> instead." [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: "Never use <amxintrin.h> directly; include <immintrin.h> instead." [clang…
	#endif /* __IMMINTRIN_H */			#endif /* __IMMINTRIN_H */

	#ifndef __AMXINTRIN_H			#ifndef __AMXINTRIN_H
	#define __AMXINTRIN_H			#define __AMXINTRIN_H
	#ifdef __x86_64__			#ifdef __x86_64__

	#define __DEFAULT_FN_ATTRS_TILE \			#define __DEFAULT_FN_ATTRS_TILE \
	__attribute__((__always_inline__, __nodebug__, __target__("amx-tile")))			__attribute__((__always_inline__, __nodebug__, __target__("amx-tile")))

	/// Load tile configuration from a 64-byte memory location specified by			/// Load tile configuration from a 64-byte memory location specified by
	/// "mem_addr". The tile configuration includes the tile type palette, the			/// "mem_addr". The tile configuration includes the tile type palette, the
	/// number of bytes per row, and the number of rows. If the specified			/// number of bytes per row, and the number of rows. If the specified
	/// palette_id is zero, that signifies the init state for both the tile			/// palette_id is zero, that signifies the init state for both the tile
	/// config and the tile data, and the tiles are zeroed. Any invalid			/// config and the tile data, and the tiles are zeroed. Any invalid
	/// configurations will result in #GP fault.			/// configurations will result in #GP fault.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> LDTILECFG </c> instruction.			/// This intrinsic corresponds to the <c> LDTILECFG </c> instruction.
	///			///
	/// \param __config			/// \param __config
	/// A pointer to 512-bits configuration			/// A pointer to 512-bits configuration
	static __inline__ void __DEFAULT_FN_ATTRS_TILE			static __inline__ void __DEFAULT_FN_ATTRS_TILE
	_tile_loadconfig(const void *__config) {			_tile_loadconfig(const void *__config) {
	__builtin_ia32_tile_loadconfig(__config);			__builtin_ia32_tile_loadconfig(__config);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_ia32_tile_loadconfig' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_ia32_tile_loadconfig' [clang…
	}			}

	/// Stores the current tile configuration to a 64-byte memory location			/// Stores the current tile configuration to a 64-byte memory location
	/// specified by "mem_addr". The tile configuration includes the tile type			/// specified by "mem_addr". The tile configuration includes the tile type
	/// palette, the number of bytes per row, and the number of rows. If tiles			/// palette, the number of bytes per row, and the number of rows. If tiles
	/// are not configured, all zeroes will be stored to memory.			/// are not configured, all zeroes will be stored to memory.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> STTILECFG </c> instruction.			/// This intrinsic corresponds to the <c> STTILECFG </c> instruction.
	///			///
	/// \param __config			/// \param __config
	/// A pointer to 512-bits configuration			/// A pointer to 512-bits configuration
	static __inline__ void __DEFAULT_FN_ATTRS_TILE			static __inline__ void __DEFAULT_FN_ATTRS_TILE
	_tile_storeconfig(void *__config) {			_tile_storeconfig(void *__config) {
	__builtin_ia32_tile_storeconfig(__config);			__builtin_ia32_tile_storeconfig(__config);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_ia32_tile_storeconfig' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_ia32_tile_storeconfig' [clang…
	}			}

	/// Release the tile configuration to return to the init state, which			/// Release the tile configuration to return to the init state, which
	/// releases all storage it currently holds.			/// releases all storage it currently holds.
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	/// This intrinsic corresponds to the <c> TILERELEASE </c> instruction.			/// This intrinsic corresponds to the <c> TILERELEASE </c> instruction.
	static __inline__ void __DEFAULT_FN_ATTRS_TILE _tile_release(void) {			static __inline__ void __DEFAULT_FN_ATTRS_TILE _tile_release(void) {
	__builtin_ia32_tilerelease();			__builtin_ia32_tilerelease();
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_ia32_tilerelease' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_ia32_tilerelease' [clang-diagnostic…
	}			}

	/// Load tile rows from memory specifieid by "base" address and "stride" into			/// Load tile rows from memory specifieid by "base" address and "stride" into
	/// destination tile "dst" using the tile configuration previously configured			/// destination tile "dst" using the tile configuration previously configured
	/// via "_tile_loadconfig".			/// via "_tile_loadconfig".
	///			///
	/// \headerfile <x86intrin.h>			/// \headerfile <x86intrin.h>
	///			///
	▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines

	#define __DEFAULT_FN_ATTRS_INT8 \			#define __DEFAULT_FN_ATTRS_INT8 \
	__attribute__((__always_inline__, __nodebug__, __target__("amx-int8")))			__attribute__((__always_inline__, __nodebug__, __target__("amx-int8")))

	typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));			typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));
	static __inline__ _tile1024i __DEFAULT_FN_ATTRS_INT8			static __inline__ _tile1024i __DEFAULT_FN_ATTRS_INT8
	_tile_loadd_internal(unsigned short m, unsigned short n, const void *base,			_tile_loadd_internal(unsigned short m, unsigned short n, const void *base,
	__SIZE_TYPE__ stride) {			__SIZE_TYPE__ stride) {
	return __builtin_ia32_tileloadd64_internal(m, n, base,			return __builtin_ia32_tileloadd64_internal(m, n, base,
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_ia32_tileloadd64_internal' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_ia32_tileloadd64_internal' [clang…
	(__SIZE_TYPE__)(stride));			(__SIZE_TYPE__)(stride));
	}			}

	static __inline__ _tile1024i __DEFAULT_FN_ATTRS_INT8			static __inline__ _tile1024i __DEFAULT_FN_ATTRS_INT8
	_tile_dpbssd_internal(unsigned short m, unsigned short n, unsigned short k,			_tile_dpbssd_internal(unsigned short m, unsigned short n, unsigned short k,
	_tile1024i dst, _tile1024i src1, _tile1024i src2) {			_tile1024i dst, _tile1024i src1, _tile1024i src2) {
	return __builtin_ia32_tdpbssd_internal(m, n, k, dst, src1, src2);			return __builtin_ia32_tdpbssd_internal(m, n, k, dst, src1, src2);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_ia32_tdpbssd_internal' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_ia32_tdpbssd_internal' [clang…
	}			}

	static __inline__ void __DEFAULT_FN_ATTRS_INT8			static __inline__ void __DEFAULT_FN_ATTRS_INT8
	_tile_stored_internal(unsigned short m, unsigned short n, void *base,			_tile_stored_internal(unsigned short m, unsigned short n, void *base,
	__SIZE_TYPE__ stride, _tile1024i tile) {			__SIZE_TYPE__ stride, _tile1024i tile) {
	return __builtin_ia32_tilestored64_internal(m, n, base,			return __builtin_ia32_tilestored64_internal(m, n, base,
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_ia32_tilestored64_internal' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_ia32_tilestored64_internal' [clang…
	(__SIZE_TYPE__)(stride), tile);			(__SIZE_TYPE__)(stride), tile);
	}			}

	typedef struct __tile1024i_str {			typedef struct __tile1024i_str {
	const unsigned short row;			const unsigned short row;
	const unsigned short col;			const unsigned short col;
	_tile1024i tile;			_tile1024i tile;
	} __tile1024i;			} __tile1024i;

	__DEFAULT_FN_ATTRS_TILE			__DEFAULT_FN_ATTRS_TILE
	static void __tile_loadd(__tile1024i dst, const void base,			static void __tile_loadd(__tile1024i dst, const void base,
	__SIZE_TYPE__ stride) {			__SIZE_TYPE__ stride) {
	dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);			dst->tile = _tile_loadd_internal(dst->row, dst->col, base, stride);
	}			}

	__DEFAULT_FN_ATTRS_INT8			__DEFAULT_FN_ATTRS_INT8
	static void __tile_dpbsud(__tile1024i *dst, __tile1024i src1,			static void __tile_dpbssd(__tile1024i *dst, __tile1024i src1,
	__tile1024i src2) {			__tile1024i src2) {
	dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, dst->tile,			dst->tile = _tile_dpbssd_internal(src1.row, src2.col, src1.col, dst->tile,
	src1.tile, src2.tile);			src1.tile, src2.tile);
	}			}

	__DEFAULT_FN_ATTRS_TILE			__DEFAULT_FN_ATTRS_TILE
	static void __tile_stored(void *base, __SIZE_TYPE__ stride, __tile1024i src) {			static void __tile_stored(void *base, __SIZE_TYPE__ stride, __tile1024i src) {
	_tile_stored_internal(src.row, src.col, base, stride, src.tile);			_tile_stored_internal(src.row, src.col, base, stride, src.tile);
	}			}

	__DEFAULT_FN_ATTRS_TILE			__DEFAULT_FN_ATTRS_TILE
	static void __tile_zero(__tile1024i *dst) {			static void __tile_zero(__tile1024i *dst) {
	dst->tile = __builtin_ia32_tilezero_internal(dst->row, dst->col);			dst->tile = __builtin_ia32_tilezero_internal(dst->row, dst->col);
				Lint: Pre-merge checks Inline Actions clang-tidy: error: use of undeclared identifier '__builtin_ia32_tilezero_internal' [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: use of undeclared identifier '__builtin_ia32_tilezero_internal' [clang…
	}			}

	#endif /* __x86_64__ */			#endif /* __x86_64__ */
	#endif /* __AMXINTRIN_H */			#endif /* __AMXINTRIN_H */

llvm/include/llvm/CodeGen/Passes.h

Show First 20 Lines • Show All 486 Lines • ▼ Show 20 Lines	/// MachineDominanaceFrontier - This pass is a machine dominators analysis pass.

/// The pass fixups statepoint machine instruction to replace usage of		/// The pass fixups statepoint machine instruction to replace usage of
/// caller saved registers with stack slots.		/// caller saved registers with stack slots.
extern char &FixupStatepointCallerSavedID;		extern char &FixupStatepointCallerSavedID;

/// The pass transform load/store <256 x i32> to AMX load/store intrinsics		/// The pass transform load/store <256 x i32> to AMX load/store intrinsics
/// or split the data to two <128 x i32>.		/// or split the data to two <128 x i32>.
FunctionPass *createX86LowerAMXTypePass();		FunctionPass *createX86LowerAMXTypePass();

		FunctionPass *createX86LowerAMXIntrinsicsPass();
		LuoYuankeUnsubmitted Done Reply Inline Actions Add comments to describe what the pass does? LuoYuanke: Add comments to describe what the pass does?
		pengfeiUnsubmitted Done Reply Inline Actions transforms pengfei: transforms
		pengfeiUnsubmitted Not Done Reply Inline Actions transforms pengfei: transforms
} // End llvm namespace		} // End llvm namespace

#endif		#endif

llvm/lib/Target/X86/CMakeLists.txt

Show All 27 Lines	set(sources
X86AvoidTrailingCall.cpp		X86AvoidTrailingCall.cpp
X86CallFrameOptimization.cpp		X86CallFrameOptimization.cpp
X86CallingConv.cpp		X86CallingConv.cpp
X86CallLowering.cpp		X86CallLowering.cpp
X86CmovConversion.cpp		X86CmovConversion.cpp
X86DomainReassignment.cpp		X86DomainReassignment.cpp
X86DiscriminateMemOps.cpp		X86DiscriminateMemOps.cpp
X86LowerAMXType.cpp		X86LowerAMXType.cpp
		X86LowerAMXIntrinsics.cpp
X86TileConfig.cpp		X86TileConfig.cpp
X86PreTileConfig.cpp		X86PreTileConfig.cpp
X86ExpandPseudo.cpp		X86ExpandPseudo.cpp
X86FastISel.cpp		X86FastISel.cpp
X86FixupBWInsts.cpp		X86FixupBWInsts.cpp
X86FixupLEAs.cpp		X86FixupLEAs.cpp
X86AvoidStoreForwardingBlocks.cpp		X86AvoidStoreForwardingBlocks.cpp
X86FixupSetCC.cpp		X86FixupSetCC.cpp
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86.h

	Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines
	void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);			void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);
	void initializeX86OptimizeLEAPassPass(PassRegistry &);			void initializeX86OptimizeLEAPassPass(PassRegistry &);
	void initializeX86PartialReductionPass(PassRegistry &);			void initializeX86PartialReductionPass(PassRegistry &);
	void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);			void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);
	void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);			void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);
	void initializeX86PreTileConfigPass(PassRegistry &);			void initializeX86PreTileConfigPass(PassRegistry &);
	void initializeX86TileConfigPass(PassRegistry &);			void initializeX86TileConfigPass(PassRegistry &);
	void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);			void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);
				void initializeX86LowerAMXIntrinsicsLegacyPassPass(PassRegistry &);

	namespace X86AS {			namespace X86AS {
	enum : unsigned {			enum : unsigned {
	GS = 256,			GS = 256,
	FS = 257,			FS = 257,
	SS = 258,			SS = 258,
	PTR32_SPTR = 270,			PTR32_SPTR = 270,
	PTR32_UPTR = 271,			PTR32_UPTR = 271,
	PTR64 = 272			PTR64 = 272
	};			};
	} // End X86AS namespace			} // End X86AS namespace

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp

This file was added.

				//===- llvm/CodeGen/TileShapeInfo.h - ---------------------------- C++ --===//
				//
				LuoYuankeUnsubmitted Done Reply Inline Actions This seems wrong file name. LuoYuanke: This seems wrong file name.
				pengfeiUnsubmitted Done Reply Inline Actions We usually comment as ===--- filename - description ---=== See `head -n1 llvm/lib/Target/X86/.cpp` pengfei:* We usually comment as //===--- filename - description ---===// See `head -n1…
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file Pass to transform amx intrinsics to scalar operation.
				/// This pass is only enabled with -O0. With -O0, the def of shape to amx
				pengfeiUnsubmitted Not Done Reply Inline Actions operations pengfei: operations
				/// intrinsics is near the amx intrinsics code. We are not bale to find a
				pengfeiUnsubmitted Not Done Reply Inline Actions We always enable it. Also need to mention optnone. pengfei: We always enable it. Also need to mention optnone.
				/// point which post-dominate all the shape and dominate all amx intrinsics.
				LuoYuankeUnsubmitted Done Reply Inline Actions Type 'able'. LuoYuanke: Type 'able'.
				/// To decouple the dependency of the shape, we transform amx intrinsics
				/// to scalar operation, so that compiling doesn't fail. In long term, we
				/// should improve fast register allocation to allocate amx register.
				//===----------------------------------------------------------------------===//
				//
				#include "X86.h"
				#include "llvm/ADT/DenseSet.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/Analysis/DomTreeUpdater.h"
				#include "llvm/Analysis/OptimizationRemarkEmitter.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/CodeGen/Passes.h"
				#include "llvm/CodeGen/ValueTypes.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/IntrinsicsX86.h"
				#include "llvm/IR/PatternMatch.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Pass.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/LoopUtils.h"

				using namespace llvm;
				using namespace PatternMatch;

				#define DEBUG_TYPE "lower-amx-intrinsics"

				static BasicBlock createLoop(BasicBlock Preheader, BasicBlock *Exit,
				Value Bound, Value Step, StringRef Name,
				IRBuilderBase &B, DomTreeUpdater &DTU, Loop *L,
				LoopInfo &LI) {
				LLVMContext &Ctx = Preheader->getContext();
				pengfeiUnsubmitted Not Done Reply Inline Actions Curly brackets are not necessary here. pengfei: Curly brackets are not necessary here.
				BasicBlock *Header = BasicBlock::Create(
				Preheader->getContext(), Name + ".header", Preheader->getParent(), Exit);
				BasicBlock *Body = BasicBlock::Create(Header->getContext(), Name + ".body",
				Header->getParent(), Exit);
				BasicBlock *Latch = BasicBlock::Create(Header->getContext(), Name + ".latch",
				pengfeiUnsubmitted Done Reply Inline Actions Ctx pengfei: Ctx
				Header->getParent(), Exit);

				Type *I16Ty = Type::getInt16Ty(Ctx);
				BranchInst::Create(Body, Header);
				BranchInst::Create(Latch, Body);
				PHINode *IV =
				PHINode::Create(I16Ty, 2, Name + ".iv", Header->getTerminator());
				IV->addIncoming(ConstantInt::get(I16Ty, 0), Preheader);

				B.SetInsertPoint(Latch);
				Value *Inc = B.CreateAdd(IV, Step, Name + ".step");
				Value *Cond = B.CreateICmpNE(Inc, Bound, Name + ".cond");
				BranchInst::Create(Header, Exit, Cond, Latch);
				IV->addIncoming(Inc, Latch);

				BranchInst *PreheaderBr = cast<BranchInst>(Preheader->getTerminator());
				BasicBlock *Tmp = PreheaderBr->getSuccessor(0);
				PreheaderBr->setSuccessor(0, Header);
				DTU.applyUpdatesPermissive({
				{DominatorTree::Delete, Preheader, Tmp},
				{DominatorTree::Insert, Header, Body},
				{DominatorTree::Insert, Body, Latch},
				{DominatorTree::Insert, Latch, Header},
				{DominatorTree::Insert, Latch, Exit},
				{DominatorTree::Insert, Preheader, Header},
				});

				L->addBasicBlockToLoop(Header, LI);
				L->addBasicBlockToLoop(Body, LI);
				L->addBasicBlockToLoop(Latch, LI);
				return Body;
				pengfeiUnsubmitted Not Done Reply Inline Actions Do we need to remove the successor? Isn't it still being dominated? pengfei: Do we need to remove the successor? Isn't it still being dominated?
				LuoYuankeUnsubmitted Not Done Reply Inline Actions I think this is to remove edge from preheader to tmp, because we insert a loop between them. LuoYuanke: I think this is to remove edge from preheader to tmp, because we insert a loop between them.
				}

				static Value createTileLoadLoops(BasicBlock Start, BasicBlock *End,
				IRBuilderBase &B, DomTreeUpdater &DTU,
				LoopInfo &LI, Value Row, Value Col,
				Value Ptr, Value Stride) {
				pengfeiUnsubmitted Done Reply Inline Actions Can we just use `template <bool IsLoad>`? I think it also can reduce the branch. pengfei: Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.
				LuoYuankeUnsubmitted Not Done Reply Inline Actions Why do we need a template instead of passing a parameter `bool IsLoad`? LuoYuanke: Why do we need a template instead of passing a parameter `bool IsLoad`?
				pengfeiUnsubmitted Not Done Reply Inline Actions Bing thought template instantiation can avoid the condition code to turn into branch instructions. pengfei: Bing thought template instantiation can avoid the condition code to turn into branch…
				LuoYuankeUnsubmitted Not Done Reply Inline Actions That may be arguable what benefit more. Code size saving or branch instructions avoiding. :) LuoYuanke: That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)
				Loop *RowLoop = LI.AllocateLoop();
				Loop *ColLoop = LI.AllocateLoop();
				RowLoop->addChildLoop(ColLoop);
				if (Loop *ParentL = LI.getLoopFor(Start))
				ParentL->addChildLoop(RowLoop);
				else
				LI.addTopLevelLoop(RowLoop);

				pengfeiUnsubmitted Not Done Reply Inline Actions IsTileLoad pengfei: IsTileLoad
				BasicBlock *RowBody = createLoop(Start, End, Row, B.getInt16(1),
				"tileload.unroll.rows", B, DTU, RowLoop, LI);
				BasicBlock *RowLatch = RowBody->getSingleSuccessor();
				pengfeiUnsubmitted Not Done Reply Inline Actions Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop. pengfei: Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can…
				yubingAuthorUnsubmitted Done Reply Inline Actions I think We can compose a follow-up patch for this optimization yubing: I think We can compose a follow-up patch for this optimization

				// uint16_t ColStep = B.getInt32Ty()->getPrimitiveSizeInBits() / 8;
				uint16_t ColStep = 1;
				BasicBlock *ColBody = createLoop(RowBody, RowLatch, Col, B.getInt16(ColStep),
				"tileload.unroll.cols", B, DTU, ColLoop, LI);

				BasicBlock *ColLoopLatch = ColBody->getSingleSuccessor();
				BasicBlock *ColumnLoopHeader = ColBody->getSinglePredecessor();
				BasicBlock *RowLoopHeader = RowBody->getSinglePredecessor();
				Value CurrentRow = &RowLoopHeader->begin();
				Value CurrentCol = &ColumnLoopHeader->begin();

				// tileload.unroll.rows.header:
				// %vec.phi.row = phi <256 x i32> [ zeroinitializer, %entry ], [ %40,
				// %tileload.unroll.rows.latch ]
				B.SetInsertPoint(RowLoopHeader->getTerminator());
				FixedVectorType *V256I32Ty = FixedVectorType::get(B.getInt32Ty(), 256);
				Value *VecZero = Constant::getNullValue(V256I32Ty);
				pengfeiUnsubmitted Not Done Reply Inline Actions Use 1 directly? pengfei: Use 1 directly?
				PHINode *VecPhi_Row_Loop = B.CreatePHI(V256I32Ty, 2, "vec.phi.row");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'VecPhi_Row_Loop' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'VecPhi_Row_Loop' [readability-identifier…
				VecPhi_Row_Loop->addIncoming(VecZero, Start);

				// tileload.unroll.cols.header:
				pengfeiUnsubmitted Not Done Reply Inline Actions Better use the same naming conversion, i.e. `ColLoopHeader` pengfei: Better use the same naming conversion, i.e. `ColLoopHeader`
				// %vec.phi = phi <256 x i32> [ %vec.phi.row, %tileload.unroll.rows.body ], [
				// %40, %tileload.unroll.cols.latch ]
				B.SetInsertPoint(ColumnLoopHeader->getTerminator());
				// Value *UndefVec = UndefValue::get(V256I32Ty);
				PHINode *VecPhi = B.CreatePHI(V256I32Ty, 2, "vec.phi");
				pengfeiUnsubmitted Not Done Reply Inline Actions Better to change the order, e.g. Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty = FixedVectorType::get(EltTy, 256); pengfei: Better to change the order, e.g. ``` Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty =…
				VecPhi->addIncoming(VecPhi_Row_Loop, RowBody);

				// tileload.unroll.cols.body:
				// %elt = load i32 i32 *ptr
				// %mul = mul i16 %row.iv, i16 16
				// %add = add i16 %mul, i16 %col.iv
				// %vec2 = insertelement <16 x i32> %vecphi, i32 %elt, i16 %idx
				B.SetInsertPoint(ColBody->getTerminator());
				Type *EltTy = V256I32Ty->getElementType();
				Value *CurrentRowZExt = B.CreateZExt(CurrentRow, Stride->getType());
				Value *CurrentColZExt = B.CreateZExt(CurrentCol, Stride->getType());
				Value *Offset =
				B.CreateAdd(B.CreateMul(CurrentRowZExt, Stride), CurrentColZExt);
				unsigned AS = cast<PointerType>(Ptr->getType())->getAddressSpace();
				Value *EltBasePtr = B.CreatePointerCast(Ptr, PointerType::get(EltTy, AS));
				Value *EltPtr = B.CreateGEP(EltTy, EltBasePtr, Offset);
				Value *Elt = B.CreateLoad(EltTy, EltPtr);
				Value *Idx = B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentCol);
				Value *ResVec = B.CreateInsertElement(VecPhi, Elt, Idx);
				VecPhi->addIncoming(ResVec, ColLoopLatch);
				VecPhi_Row_Loop->addIncoming(ResVec, RowLatch);

				return ResVec;
				}

				static void createTileStoreLoops(BasicBlock Start, BasicBlock End,
				IRBuilderBase &B, DomTreeUpdater &DTU,
				LoopInfo &LI, Value Row, Value Col,
				Value Ptr, Value Stride, Value *Tile) {
				Loop *RowLoop = LI.AllocateLoop();
				Loop *ColLoop = LI.AllocateLoop();
				RowLoop->addChildLoop(ColLoop);
				if (Loop *ParentL = LI.getLoopFor(Start))
				LuoYuankeUnsubmitted Done Reply Inline Actions Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions. LuoYuanke: Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so…
				ParentL->addChildLoop(RowLoop);
				else
				LI.addTopLevelLoop(RowLoop);

				BasicBlock *RowBody =
				createLoop(Start, End, Row, B.getInt16(1), "tilestore.unroll.rows", B,
				DTU, RowLoop, LI);
				pengfeiUnsubmitted Done Reply Inline Actions Maybe we can just use cast to help to raise the assertion. pengfei: Maybe we can just use cast to help to raise the assertion.
				BasicBlock *RowLatch = RowBody->getSingleSuccessor();

				uint16_t ColStep = 1;
				BasicBlock *ColBody =
				createLoop(RowBody, RowLatch, Col, B.getInt16(ColStep),
				"tilestore.unroll.cols", B, DTU, ColLoop, LI);

				BasicBlock *ColumnLoopHeader = ColBody->getSinglePredecessor();
				BasicBlock *RowLoopHeader = RowBody->getSinglePredecessor();
				Value CurrentRow = &RowLoopHeader->begin();
				Value CurrentCol = &ColumnLoopHeader->begin();

				Value *Vec = nullptr;
				if (auto BitCast = dyn_cast<BitCastInst>(Tile))
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto BitCast' can be declared as 'auto BitCast' [llvm-qualified-auto] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto BitCast' can be declared as 'auto *BitCast' [llvm-qualified-auto]…
				Vec = BitCast->getOperand(0);
				assert(Vec && Vec->getType()->isVectorTy() &&
				"bitcast from non-v256i32 to x86amx");

				B.SetInsertPoint(ColumnLoopHeader->getTerminator());
				FixedVectorType *V256I32Ty = FixedVectorType::get(B.getInt32Ty(), 256);
				Type *EltTy = V256I32Ty->getElementType();

				// cols.body:
				B.SetInsertPoint(ColBody->getTerminator());
				Value *CurrentRowZExt = B.CreateZExt(CurrentRow, Stride->getType());
				Value *CurrentColZExt = B.CreateZExt(CurrentCol, Stride->getType());
				Value *Offset =
				B.CreateAdd(B.CreateMul(CurrentRowZExt, Stride), CurrentColZExt);
				unsigned AS = cast<PointerType>(Ptr->getType())->getAddressSpace();
				Value *EltBasePtr = B.CreatePointerCast(Ptr, PointerType::get(EltTy, AS));
				Value *EltPtr = B.CreateGEP(EltTy, EltBasePtr, Offset);
				// %mul = mul i16 %row.iv, i16 16
				// %idx = add i16 %mul, i16 %col.iv
				// %vec = extractelement <16 x i32> %vec, i16 %idx
				// store i32 %vec, i32* %ptr
				Value *Idx = B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentCol);
				Value *Elt = B.CreateExtractElement(Vec, Idx);

				B.CreateStore(Elt, EltPtr);
				}

				static Value createTileDPBSSDLoops(BasicBlock Start, BasicBlock *End,
				IRBuilderBase &B, DomTreeUpdater &DTU,
				LoopInfo &LI, Value Row, Value Col,
				Value K, Value Acc, Value *LHS,
				Value *RHS) {
				xiangzhangllvmUnsubmitted Not Done Reply Inline Actions In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0). Then can use vector to handle all of the them, let type legalization to split the type. xiangzhangllvm: In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is…
				yubingAuthorUnsubmitted Done Reply Inline Actions We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero. yubing: We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0…
				Loop *RowLoop = LI.AllocateLoop();
				Loop *ColLoop = LI.AllocateLoop();
				Loop *InnerLoop = LI.AllocateLoop();
				ColLoop->addChildLoop(InnerLoop);
				RowLoop->addChildLoop(ColLoop);
				if (Loop *ParentL = LI.getLoopFor(Start))
				ParentL->addChildLoop(RowLoop);
				else
				LI.addTopLevelLoop(RowLoop);

				BasicBlock *RowBody =
				pengfeiUnsubmitted Done Reply Inline Actions You can use cast to help to check the failure so that VecA/B/C won't be uninitialized. pengfei: You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.
				createLoop(Start, End, Row, B.getInt16(1), "tiledpbssd.unroll.rows", B,
				DTU, RowLoop, LI);
				BasicBlock *RowLatch = RowBody->getSingleSuccessor();

				BasicBlock *ColBody =
				createLoop(RowBody, RowLatch, Col, B.getInt16(1),
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				"tiledpbssd.unroll.cols", B, DTU, ColLoop, LI);
				BasicBlock *ColLoopLatch = ColBody->getSingleSuccessor();
				pengfeiUnsubmitted Done Reply Inline Actions Should check it is V256I32? pengfei: Should check it is V256I32?

				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				B.SetInsertPoint(ColBody->getTerminator());
				BasicBlock *InnerBody =
				createLoop(ColBody, ColLoopLatch, K, B.getInt16(1),
				"tiledpbssd.unroll.inner", B, DTU, InnerLoop, LI);

				BasicBlock *ColumnLoopHeader = ColBody->getSinglePredecessor();
				BasicBlock *RowLoopHeader = RowBody->getSinglePredecessor();
				BasicBlock *InnerLoopHeader = InnerBody->getSinglePredecessor();
				BasicBlock *InnerLoopLatch = InnerBody->getSingleSuccessor();
				Value CurrentRow = &RowLoopHeader->begin();
				Value CurrentCol = &ColumnLoopHeader->begin();
				Value CurrentInner = &InnerLoopHeader->begin();

				FixedVectorType *V256I32Ty = FixedVectorType::get(B.getInt32Ty(), 256);
				// Type *EltTy = V256I32Ty->getElementType();
				Value VecC, VecA, *VecB;
				if (auto BitCast = dyn_cast<BitCastInst>(Acc))
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto BitCast' can be declared as 'auto BitCast' [llvm-qualified-auto] not useful clang-tidy: warning: variable 'VecC' is used uninitialized whenever 'if' condition is false [clang-diagnostic-sometimes-uninitialized] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto BitCast' can be declared as 'auto *BitCast' [llvm-qualified-auto]…
				VecC = BitCast->getOperand(0);
				LuoYuankeUnsubmitted Done Reply Inline Actions Delete the dead code. LuoYuanke: Delete the dead code.
				assert(VecC->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
				// TODO else create BitCast from x86amx to v256i32.
				// Store x86amx to memory, and reload from memory
				// to vector. However with -O0, it doesn't happen.
				if (auto BitCast = dyn_cast<BitCastInst>(LHS))
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto BitCast' can be declared as 'auto BitCast' [llvm-qualified-auto] not useful clang-tidy: warning: variable 'VecA' is used uninitialized whenever 'if' condition is false [clang-diagnostic-sometimes-uninitialized] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto BitCast' can be declared as 'auto *BitCast' [llvm-qualified-auto]…
				VecA = BitCast->getOperand(0);
				assert(VecA->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
				if (auto BitCast = dyn_cast<BitCastInst>(RHS))
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto BitCast' can be declared as 'auto BitCast' [llvm-qualified-auto] not useful clang-tidy: warning: variable 'VecB' is used uninitialized whenever 'if' condition is false [clang-diagnostic-sometimes-uninitialized] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto BitCast' can be declared as 'auto *BitCast' [llvm-qualified-auto]…
				VecB = BitCast->getOperand(0);
				assert(VecB->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");

				// tiledpbssd.unroll.rows.header:
				// %vec.phi.rows = phi <256 x i32> [ %vec_c, %continue ], [ %NewVecC,
				// %tiledpbssd.unroll.rows.latch ]
				B.SetInsertPoint(RowLoopHeader->getTerminator());
				PHINode *VecPhi_Row_Loop = B.CreatePHI(V256I32Ty, 2, "vec.phi.row");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'VecPhi_Row_Loop' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'VecPhi_Row_Loop' [readability-identifier…
				VecPhi_Row_Loop->addIncoming(VecC, Start);
				LuoYuankeUnsubmitted Done Reply Inline Actions It should be in another line. LuoYuanke: It should be in another line.

				// tiledpbssd.unroll.cols.header:
				// %vec.phi.cols = phi <256 x i32> [ %vec.phi.rows,
				// %tiledpbssd.unroll.rows.body ], [ %NewVecC, %tiledpbssd.unroll.cols.latch ]
				B.SetInsertPoint(ColumnLoopHeader->getTerminator());
				PHINode *VecPhi_Col_Loop = B.CreatePHI(V256I32Ty, 2, "vec.phi.col");
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'VecPhi_Col_Loop' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'VecPhi_Col_Loop' [readability-identifier…
				VecPhi_Col_Loop->addIncoming(VecPhi_Row_Loop, RowBody);

				// Generate PHI vector for C.
				B.SetInsertPoint(InnerLoopHeader->getTerminator());
				PHINode *VecCPhi = B.CreatePHI(V256I32Ty, 2, "vec.phi");
				VecCPhi->addIncoming(VecPhi_Col_Loop, ColBody);

				LuoYuankeUnsubmitted Done Reply Inline Actions Better to be in a new line. LuoYuanke: Better to be in a new line.
				// Generate accmulate multiply in innerbody.
				B.SetInsertPoint(InnerBody->getTerminator());
				Value *IdxC =
				B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentCol);
				Value *IdxA =
				B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentInner);
				Value *IdxB =
				B.CreateAdd(B.CreateMul(CurrentInner, B.getInt16(16)), CurrentCol);
				pengfeiUnsubmitted Not Done Reply Inline Actions eltc? pengfei: eltc?

				FixedVectorType *V4I8Ty = FixedVectorType::get(B.getInt8Ty(), 4);
				FixedVectorType *V4I32Ty = FixedVectorType::get(B.getInt32Ty(), 4);
				Value *EltC = B.CreateExtractElement(VecCPhi, IdxC);
				LuoYuankeUnsubmitted Done Reply Inline Actions Better to be in a new line. LuoYuanke: Better to be in a new line.
				Value *EltA = B.CreateExtractElement(VecA, IdxA);
				Value *SubVecA = B.CreateBitCast(EltA, V4I8Ty);
				Value *EltB = B.CreateExtractElement(VecB, IdxB);
				Value *SubVecB = B.CreateBitCast(EltB, V4I8Ty);
				Value *SubVecR = B.CreateAddReduce(B.CreateMul(
				B.CreateSExt(SubVecA, V4I32Ty), B.CreateSExt(SubVecB, V4I32Ty)));
				Value *ResElt = B.CreateAdd(EltC, SubVecR);
				Value *NewVecC = B.CreateInsertElement(VecCPhi, ResElt, IdxC);
				VecCPhi->addIncoming(NewVecC, InnerLoopLatch);
				VecPhi_Row_Loop->addIncoming(NewVecC, RowLatch);
				VecPhi_Col_Loop->addIncoming(NewVecC, ColLoopLatch);

				return NewVecC;
				}

				namespace {
				class X86LowerAMXIntrinsics {
				Function &Func;

				pengfeiUnsubmitted Not Done Reply Inline Actions Is it necessary to insert the ResElt to VecC? pengfei: Is it necessary to insert the ResElt to VecC?
				yubingAuthorUnsubmitted Done Reply Inline Actions Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct: Cij =Cij+Ai1.B1j Cij =Cij+Ai2.B2j .... Cij =Cij+AiK.BKj yubing:* Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix…
				pengfeiUnsubmitted Not Done Reply Inline Actions But you don't need to update both C and D. Something like the psudo code should enough: for (k : K) Dij += Aik * Bkj; Dij += Cij pengfei: But you don't need to update both C and D. Something like the psudo code should enough: ``` for…
				yubingAuthorUnsubmitted Done Reply Inline Actions I change code into the following style, and it can also reduce inner loop's size: for (k : K) Cij += Aik * Bkj; Dij = Cij Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops. yubing: I change code into the following style, and it can also reduce inner loop's size: ``` for (k…
				LuoYuankeUnsubmitted Not Done Reply Inline Actions It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C. LuoYuanke: It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert…
				yubingAuthorUnsubmitted Done Reply Inline Actions But your solution still need to update D so D's phi will be kept in the inner loops. yubing: But your solution still need to update D so D's phi will be kept in the inner loops.
				public:
				X86LowerAMXIntrinsics(Function &F, DominatorTree DT, LoopInfo LI)
				: Func(F), DT(DT), LI(LI) {}
				bool visit();

				private:
				DominatorTree *DT;
				LoopInfo *LI;
				bool lowerTileLoad(Instruction *TileLoad);
				bool lowerTileDPBSSD(Instruction *TileDPBSSD);
				bool lowerTileStore(Instruction *TileStore);
				bool lowerTileZero(Instruction *TileZero);
				};

				bool X86LowerAMXIntrinsics::lowerTileDPBSSD(Instruction *TileDPBSSD) {
				Value M, N, K, C, A, B;
				match(TileDPBSSD, m_Intrinsic<Intrinsic::x86_tdpbssd_internal>(
				m_Value(M), m_Value(N), m_Value(K), m_Value(C),
				m_Value(A), m_Value(B)));
				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				Instruction *InsertI = TileDPBSSD;
				IRBuilder<> Builder_Prepare(TileDPBSSD);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Builder_Prepare' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Builder_Prepare' [readability-identifier…
				Builder_Prepare.SetInsertPoint(TileDPBSSD);
				// We visit the loop with (m, n/4, k/4):
				// %n_dword = udiv i16 %n, 4
				// %k_dword = udiv i16 %k, 4
				Value *N_DWord = Builder_Prepare.CreateUDiv(N, Builder_Prepare.getInt16(4));
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'N_DWord' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'N_DWord' [readability-identifier-naming]…
				Value *K_DWord = Builder_Prepare.CreateUDiv(K, Builder_Prepare.getInt16(4));
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'K_DWord' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'K_DWord' [readability-identifier-naming]…
				BasicBlock *Start = InsertI->getParent();
				pengfeiUnsubmitted Done Reply Inline Actions TileLoadStore pengfei: TileLoadStore
				BasicBlock *End =
				pengfeiUnsubmitted Done Reply Inline Actions Forgot to remove? pengfei: Forgot to remove?
				SplitBlock(InsertI->getParent(), InsertI, DT, LI, nullptr, "continue");
				IRBuilder<> Builder(TileDPBSSD);
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				Value ResVec = createTileDPBSSDLoops(Start, End, Builder, DTU, LI, M,
				N_DWord, K_DWord, C, A, B);

				// Delete tileloadd6 intrinsic and bitcast instruction.
				for (auto UI = TileDPBSSD->use_begin(), UE = TileDPBSSD->use_end();
				UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(ResVec);
				I->eraseFromParent();
				}
				}
				xiangzhangllvmUnsubmitted Not Done Reply Inline Actions I see you need force match bitcast then replace, add assert for no bitcast case xiangzhangllvm: I see you need force match bitcast then replace, add assert for no bitcast case
				TileDPBSSD->eraseFromParent();
				return true;
				}

				bool X86LowerAMXIntrinsics::lowerTileLoad(Instruction *TileLoad) {
				Value M, N, Ptr, Stride;
				match(TileLoad, m_Intrinsic<Intrinsic::x86_tileloadd64_internal>(
				m_Value(M), m_Value(N), m_Value(Ptr), m_Value(Stride)));
				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				Instruction *InsertI = TileLoad;
				IRBuilder<> Builder_Prepare(TileLoad);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Builder_Prepare' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Builder_Prepare' [readability-identifier…
				Builder_Prepare.SetInsertPoint(TileLoad);
				Value *N_DWord = Builder_Prepare.CreateUDiv(N, Builder_Prepare.getInt16(4));
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'N_DWord' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'N_DWord' [readability-identifier-naming]…
				Value *Stride_DWord =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Stride_DWord' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Stride_DWord' [readability-identifier…
				Builder_Prepare.CreateUDiv(Stride, Builder_Prepare.getInt64(4));
				BasicBlock *Start = InsertI->getParent();
				LuoYuankeUnsubmitted Done Reply Inline Actions The name seems not good. Is "PreBuilder" better? And why we need two builder in the function? LuoYuanke: The name seems not good. Is "PreBuilder" better? And why we need two builder in the function?
				BasicBlock *End =
				SplitBlock(InsertI->getParent(), InsertI, DT, LI, nullptr, "continue");
				IRBuilder<> Builder(TileLoad);
				Value ResVec = createTileLoadLoops(Start, End, Builder, DTU, LI, M, N_DWord,
				Ptr, Stride_DWord);
				LuoYuankeUnsubmitted Done Reply Inline Actions Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation. LuoYuanke: Maybe use right shift instruction which is more efficient. Don't the following pass can…

				// Delete tileloadd6 intrinsic and bitcast instruction.
				for (auto UI = TileLoad->use_begin(), UE = TileLoad->use_end(); UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(ResVec);
				I->eraseFromParent();
				}
				}
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				TileLoad->eraseFromParent();
				return true;
				}

				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				bool X86LowerAMXIntrinsics::lowerTileStore(Instruction *TileStore) {
				Value M, N, Ptr, Stride, *Tile;
				match(TileStore, m_Intrinsic<Intrinsic::x86_tilestored64_internal>(
				m_Value(M), m_Value(N), m_Value(Ptr), m_Value(Stride),
				m_Value(Tile)));
				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				Instruction *InsertI = TileStore;
				IRBuilder<> Builder_Prepare(TileStore);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Builder_Prepare' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Builder_Prepare' [readability-identifier…
				Builder_Prepare.SetInsertPoint(TileStore);
				Value *N_DWord = Builder_Prepare.CreateUDiv(N, Builder_Prepare.getInt16(4));
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'N_DWord' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'N_DWord' [readability-identifier-naming]…
				Value *Stride_DWord =
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'Stride_DWord' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'Stride_DWord' [readability-identifier…
				Builder_Prepare.CreateUDiv(Stride, Builder_Prepare.getInt64(4));
				BasicBlock *Start = InsertI->getParent();
				BasicBlock *End =
				SplitBlock(InsertI->getParent(), InsertI, DT, LI, nullptr, "continue");
				IRBuilder<> Builder(TileStore);
				createTileStoreLoops(Start, End, Builder, DTU, *LI, M, N_DWord, Ptr,
				Stride_DWord, Tile);

				TileStore->eraseFromParent();
				LuoYuankeUnsubmitted Done Reply Inline Actions Is "PreBuilder" better? LuoYuanke: Is "PreBuilder" better?
				return true;
				}

				bool X86LowerAMXIntrinsics::lowerTileZero(Instruction *TileZero) {
				LuoYuankeUnsubmitted Done Reply Inline Actions Shift? LuoYuanke: Shift?
				IRBuilder<> Builder(TileZero);
				FixedVectorType *V256I32Ty = FixedVectorType::get(Builder.getInt32Ty(), 256);
				Value *VecZero = Constant::getNullValue(V256I32Ty);
				for (auto UI = TileZero->use_begin(), UE = TileZero->use_end(); UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(VecZero);
				I->eraseFromParent();
				}
				}
				TileZero->eraseFromParent();
				return true;
				}

				bool X86LowerAMXIntrinsics::visit() {
				bool C;
				SmallVector<Instruction *, 8> TileDPBSSDs;
				pengfeiUnsubmitted Done Reply Inline Actions `bool C = false` pengfei: `bool C = false`
				SmallVector<Instruction *, 8> TileLoads;
				SmallVector<Instruction *, 8> TileStores;
				SmallVector<Instruction *, 8> TileZeros;

				for (BasicBlock *BB : post_order(&Func)) {
				for (BasicBlock::reverse_iterator II = BB->rbegin(), IE = BB->rend();
				pengfeiUnsubmitted Done Reply Inline Actions We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast after e.g. x86_tileloadd64_internal. So we need to insert one bitcast as required. pengfei: We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast…
				II != IE;) {
				Instruction &Inst = *II++;
				if (match(&Inst, m_Intrinsic<Intrinsic::x86_tdpbssd_internal>())) {
				// %amx1 = bitcast <256 x i32> %vec to x86_amx
				// %res = call x86_amx @llvm.x86.tdpbssd.internal(i16 m, i16 n, i16 k,
				// x86_amx, %amx1, ...)
				// %vec2 = bitcast x86_amx %res to <256 x i32>
				TileDPBSSDs.push_back(&Inst);
				} else if (match(&Inst,
				LuoYuankeUnsubmitted Not Done Reply Inline Actions PreBuilder? LuoYuanke: PreBuilder?
				m_Intrinsic<Intrinsic::x86_tileloadd64_internal>())) {
				// %17 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %13, i16 %14,
				// i8* %15, i64 %16)
				// %18 = bitcast x86_amx %17 to <256 x i32>
				TileLoads.push_back(&Inst);
				} else if (match(&Inst,
				m_Intrinsic<Intrinsic::x86_tilestored64_internal>())) {
				// %89 = bitcast <256 x i32> %88 to x86_amx
				// call void @llvm.x86.tilestored64.internal(i16 %84, i16 %85, i8* %86,
				// i64 %87, x86_amx %89)
				TileStores.push_back(&Inst);
				} else if (match(&Inst,
				m_Intrinsic<Intrinsic::x86_tilezero_internal>())) {
				// %89 = bitcast <256 x i32> %88 to x86_amx
				// call void @llvm.x86.tilezero.internal(i16 %84, i16 %85)
				TileZeros.push_back(&Inst);
				}
				}
				}

				for (auto *Inst : TileLoads) {
				C \|= lowerTileLoad(Inst);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: variable 'C' is uninitialized when used here [clang-diagnostic-uninitialized] not useful Lint: Pre-merge checks: clang-tidy: warning: variable 'C' is uninitialized when used here [clang-diagnostic…
				pengfeiUnsubmitted Done Reply Inline Actions Remove the `{}` for single line loop. pengfei: Remove the `{}` for single line loop.
				}
				xiangzhangllvmUnsubmitted Done Reply Inline Actions '\|' is bits or, use logic \|\| xiangzhangllvm: '\|' is bits or, use logic \|\|
				for (auto *Inst : TileDPBSSDs) {
				C \|= lowerTileDPBSSD(Inst);
				}
				for (auto *Inst : TileStores) {
				C \|= lowerTileStore(Inst);
				}
				for (auto *Inst : TileZeros) {
				C \|= lowerTileZero(Inst);
				}

				return C;
				}
				} // anonymous namespace

				namespace {

				LuoYuankeUnsubmitted Not Done Reply Inline Actions Do we iterate the instructions in topology order or in post order? LuoYuanke: Do we iterate the instructions in topology order or in post order?
				yubingAuthorUnsubmitted Done Reply Inline Actions It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll yubing: It should be pre-order since we need to handle cases without bitcasts, such as, amx-low…
				class X86LowerAMXIntrinsicsLegacyPass : public FunctionPass {
				public:
				static char ID;
				pengfeiUnsubmitted Done Reply Inline Actions Should be better to use if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst->getIntrinsicID()) { case Intrinsic::x86_tdpbssd_internal: ... pengfei: Should be better to use ``` if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst…

				X86LowerAMXIntrinsicsLegacyPass() : FunctionPass(ID) {
				initializeX86LowerAMXIntrinsicsLegacyPassPass(
				*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override {
				if (!F.hasFnAttribute(Attribute::OptimizeNone))
				return false;

				auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();

				X86LowerAMXIntrinsics LAT(F, &DT, &LI);
				bool C = LAT.visit();
				return C;
				pengfeiUnsubmitted Done Reply Inline Actions You can just return it by `return LAT.visit()`. pengfei: You can just return it by `return LAT.visit()`.
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.addPreserved<DominatorTreeWrapperPass>();
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addPreserved<LoopInfoWrapperPass>();
				}
				};

				} // anonymous namespace

				static const char PassName[] = "Lower AMX intrinsics";
				char X86LowerAMXIntrinsicsLegacyPass::ID = 0;
				INITIALIZE_PASS_BEGIN(X86LowerAMXIntrinsicsLegacyPass, DEBUG_TYPE, PassName,
				false, false)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
				INITIALIZE_PASS_END(X86LowerAMXIntrinsicsLegacyPass, DEBUG_TYPE, PassName,
				false, false)

				FunctionPass *llvm::createX86LowerAMXIntrinsicsPass() {
				return new X86LowerAMXIntrinsicsLegacyPass();
				}

llvm/lib/Target/X86/X86TargetMachine.cpp

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableMachineCombinerPass("x86-machine-combiner",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {
// Register the target.		// Register the target.
RegisterTargetMachine<X86TargetMachine> X(getTheX86_32Target());		RegisterTargetMachine<X86TargetMachine> X(getTheX86_32Target());
RegisterTargetMachine<X86TargetMachine> Y(getTheX86_64Target());		RegisterTargetMachine<X86TargetMachine> Y(getTheX86_64Target());

PassRegistry &PR = *PassRegistry::getPassRegistry();		PassRegistry &PR = *PassRegistry::getPassRegistry();
		initializeX86LowerAMXIntrinsicsLegacyPassPass(PR);
initializeX86LowerAMXTypeLegacyPassPass(PR);		initializeX86LowerAMXTypeLegacyPassPass(PR);
initializeGlobalISel(PR);		initializeGlobalISel(PR);
initializeWinEHStatePassPass(PR);		initializeWinEHStatePassPass(PR);
initializeFixupBWInstPassPass(PR);		initializeFixupBWInstPassPass(PR);
initializeEvexToVexInstPassPass(PR);		initializeEvexToVexInstPassPass(PR);
initializeFixupLEAPassPass(PR);		initializeFixupLEAPassPass(PR);
initializeFPSPass(PR);		initializeFPSPass(PR);
initializeX86FixupSetCCPassPass(PR);		initializeX86FixupSetCCPassPass(PR);
▲ Show 20 Lines • Show All 332 Lines • ▼ Show 20 Lines	INITIALIZE_PASS_END(X86ExecutionDomainFix, "x86-execution-domain-fix",
"X86 Execution Domain Fix", false, false)		"X86 Execution Domain Fix", false, false)

TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {
return new X86PassConfig(*this, PM);		return new X86PassConfig(*this, PM);
}		}

void X86PassConfig::addIRPasses() {		void X86PassConfig::addIRPasses() {
addPass(createAtomicExpandPass());		addPass(createAtomicExpandPass());

		if (TM->getOptLevel() == CodeGenOpt::None)
		craig.topperUnsubmitted Not Done Reply Inline Actions I don't think you can detect O0 this way. A function can have the optnone attribute in the non-O0 pipeline and won't be optimized by the middle end. This can occur if you mix an 00 translation unit and an O3 translation unit in LTO and use O3 for the LTO pipeline. craig.topper: I don't think you can detect O0 this way. A function can have the optnone attribute in the non…
		LuoYuankeUnsubmitted Done Reply Inline Actions @craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass and determine if the amx intrinsics in the function need to be scalarized? LuoYuanke: @craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass…
		addPass(createX86LowerAMXIntrinsicsPass());
		LuoYuankeUnsubmitted Done Reply Inline Actions We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes. LuoYuanke: We may add both pass anyway and skip the pass based on the option level and option attribute in…
		else {
addPass(createX86LowerAMXTypePass());		addPass(createX86LowerAMXTypePass());
		}

TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();

if (TM->getOptLevel() != CodeGenOpt::None) {		if (TM->getOptLevel() != CodeGenOpt::None) {
addPass(createInterleavedAccessPass());		addPass(createInterleavedAccessPass());
addPass(createX86PartialReductionPass());		addPass(createX86PartialReductionPass());
}		}

▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-amx-intrinsics %s -S \| FileCheck %s

				define dso_local void @test_amx_load_non_O0(i16 signext %row, i16 signext %col, i8 %ptr, i64 %stride, <256 x i32> %vptr) {
				; CHECK-LABEL: @test_amx_load_non_O0(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[AMX:%.]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[ROW:%.]], i16 [[COL:%.]], i8 [[PTR:%.]], i64 [[STRIDE:%.]])
				; CHECK-NEXT: [[VEC:%.*]] = bitcast x86_amx [[AMX]] to <256 x i32>
				; CHECK-NEXT: store <256 x i32> [[VEC]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %ptr, i64 %stride)
				%vec = bitcast x86_amx %amx to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				define dso_local void @test_amx_load(i16 signext %row, i16 signext %col, i8 %ptr, i64 %stride, <256 x i32> %vptr) #0 {
				; CHECK-LABEL: @test_amx_load(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = udiv i16 [[COL:%.]], 4
				; CHECK-NEXT: [[TMP1:%.]] = udiv i64 [[STRIDE:%.]], 4
				; CHECK-NEXT: br label [[TILELOAD_UNROLL_ROWS_HEADER:%.*]]
				; CHECK: tileload.unroll.rows.header:
				; CHECK-NEXT: [[TILELOAD_UNROLL_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILELOAD_UNROLL_ROWS_STEP:%.]], [[TILELOAD_UNROLL_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: [[VEC_PHI_ROW:%.]] = phi <256 x i32> [ zeroinitializer, [[ENTRY]] ], [ [[TMP11:%.]], [[TILELOAD_UNROLL_ROWS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_UNROLL_ROWS_BODY:%.*]]
				; CHECK: tileload.unroll.rows.body:
				; CHECK-NEXT: br label [[TILELOAD_UNROLL_COLS_HEADER:%.*]]
				; CHECK: tileload.unroll.cols.header:
				; CHECK-NEXT: [[TILELOAD_UNROLL_COLS_IV:%.]] = phi i16 [ 0, [[TILELOAD_UNROLL_ROWS_BODY]] ], [ [[TILELOAD_UNROLL_COLS_STEP:%.]], [[TILELOAD_UNROLL_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <256 x i32> [ [[VEC_PHI_ROW]], [[TILELOAD_UNROLL_ROWS_BODY]] ], [ [[TMP11]], [[TILELOAD_UNROLL_COLS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_UNROLL_COLS_BODY:%.*]]
				; CHECK: tileload.unroll.cols.body:
				; CHECK-NEXT: [[TMP2:%.*]] = zext i16 [[TILELOAD_UNROLL_ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = zext i16 [[TILELOAD_UNROLL_COLS_IV]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP2]], [[TMP1]]
				; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[TMP4]], [[TMP3]]
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[PTR:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP6]], i64 [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4
				; CHECK-NEXT: [[TMP9:%.*]] = mul i16 [[TILELOAD_UNROLL_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP10:%.*]] = add i16 [[TMP9]], [[TILELOAD_UNROLL_COLS_IV]]
				; CHECK-NEXT: [[TMP11]] = insertelement <256 x i32> [[VEC_PHI]], i32 [[TMP8]], i16 [[TMP10]]
				; CHECK-NEXT: br label [[TILELOAD_UNROLL_COLS_LATCH]]
				; CHECK: tileload.unroll.cols.latch:
				; CHECK-NEXT: [[TILELOAD_UNROLL_COLS_STEP]] = add i16 [[TILELOAD_UNROLL_COLS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_UNROLL_COLS_COND:%.*]] = icmp ne i16 [[TILELOAD_UNROLL_COLS_STEP]], [[TMP0]]
				; CHECK-NEXT: br i1 [[TILELOAD_UNROLL_COLS_COND]], label [[TILELOAD_UNROLL_COLS_HEADER]], label [[TILELOAD_UNROLL_ROWS_LATCH]]
				; CHECK: tileload.unroll.rows.latch:
				; CHECK-NEXT: [[TILELOAD_UNROLL_ROWS_STEP]] = add i16 [[TILELOAD_UNROLL_ROWS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_UNROLL_ROWS_COND:%.]] = icmp ne i16 [[TILELOAD_UNROLL_ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[TILELOAD_UNROLL_ROWS_COND]], label [[TILELOAD_UNROLL_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: store <256 x i32> [[TMP11]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %ptr, i64 %stride)
				%vec = bitcast x86_amx %amx to <256 x i32>
				pengfeiUnsubmitted Not Done Reply Inline Actions Maybe we can use zero mask load in future optimization. pengfei: Maybe we can use zero mask load in future optimization.
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				define dso_local void @test_amx_dp(i16 signext %row, i16 signext %col, i16 signext %k, <256 x i32> %c, <256 x i32> %a, <256 x i32> %b, <256 x i32>* %vptr) #0 {
				; CHECK-LABEL: @test_amx_dp(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_AMX:%.]] = bitcast <256 x i32> [[A:%.]] to x86_amx
				; CHECK-NEXT: [[B_AMX:%.]] = bitcast <256 x i32> [[B:%.]] to x86_amx
				; CHECK-NEXT: [[C_AMX:%.]] = bitcast <256 x i32> [[C:%.]] to x86_amx
				; CHECK-NEXT: [[TMP0:%.]] = udiv i16 [[COL:%.]], 4
				; CHECK-NEXT: [[TMP1:%.]] = udiv i16 [[K:%.]], 4
				; CHECK-NEXT: br label [[TILEDPBSSD_UNROLL_ROWS_HEADER:%.*]]
				; CHECK: tiledpbssd.unroll.rows.header:
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILEDPBSSD_UNROLL_ROWS_STEP:%.]], [[TILEDPBSSD_UNROLL_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: [[VEC_PHI_ROW:%.]] = phi <256 x i32> [ [[C]], [[ENTRY]] ], [ [[TMP18:%.]], [[TILEDPBSSD_UNROLL_ROWS_LATCH]] ]
				; CHECK-NEXT: br label [[TILEDPBSSD_UNROLL_ROWS_BODY:%.*]]
				yubingAuthorUnsubmitted Done Reply Inline Actions Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero. yubing: Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero.
				; CHECK: tiledpbssd.unroll.rows.body:
				; CHECK-NEXT: br label [[TILEDPBSSD_UNROLL_COLS_HEADER:%.*]]
				; CHECK: tiledpbssd.unroll.cols.header:
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_COLS_IV:%.]] = phi i16 [ 0, [[TILEDPBSSD_UNROLL_ROWS_BODY]] ], [ [[TILEDPBSSD_UNROLL_COLS_STEP:%.]], [[TILEDPBSSD_UNROLL_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI_COL:%.*]] = phi <256 x i32> [ [[VEC_PHI_ROW]], [[TILEDPBSSD_UNROLL_ROWS_BODY]] ], [ [[TMP18]], [[TILEDPBSSD_UNROLL_COLS_LATCH]] ]
				; CHECK-NEXT: br label [[TILEDPBSSD_UNROLL_COLS_BODY:%.*]]
				; CHECK: tiledpbssd.unroll.cols.body:
				; CHECK-NEXT: br label [[TILEDPBSSD_UNROLL_INNER_HEADER:%.*]]
				; CHECK: tiledpbssd.unroll.inner.header:
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_INNER_IV:%.]] = phi i16 [ 0, [[TILEDPBSSD_UNROLL_COLS_BODY]] ], [ [[TILEDPBSSD_UNROLL_INNER_STEP:%.]], [[TILEDPBSSD_UNROLL_INNER_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <256 x i32> [ [[VEC_PHI_COL]], [[TILEDPBSSD_UNROLL_COLS_BODY]] ], [ [[TMP18]], [[TILEDPBSSD_UNROLL_INNER_LATCH]] ]
				; CHECK-NEXT: br label [[TILEDPBSSD_UNROLL_INNER_BODY:%.*]]
				; CHECK: tiledpbssd.unroll.inner.body:
				; CHECK-NEXT: [[TMP2:%.*]] = mul i16 [[TILEDPBSSD_UNROLL_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP3:%.*]] = add i16 [[TMP2]], [[TILEDPBSSD_UNROLL_COLS_IV]]
				; CHECK-NEXT: [[TMP4:%.*]] = mul i16 [[TILEDPBSSD_UNROLL_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP5:%.*]] = add i16 [[TMP4]], [[TILEDPBSSD_UNROLL_INNER_IV]]
				; CHECK-NEXT: [[TMP6:%.*]] = mul i16 [[TILEDPBSSD_UNROLL_INNER_IV]], 16
				; CHECK-NEXT: [[TMP7:%.*]] = add i16 [[TMP6]], [[TILEDPBSSD_UNROLL_COLS_IV]]
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <256 x i32> [[VEC_PHI]], i16 [[TMP3]]
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <256 x i32> [[A]], i16 [[TMP5]]
				; CHECK-NEXT: [[TMP10:%.*]] = bitcast i32 [[TMP9]] to <4 x i8>
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <256 x i32> [[B]], i16 [[TMP7]]
				; CHECK-NEXT: [[TMP12:%.*]] = bitcast i32 [[TMP11]] to <4 x i8>
				; CHECK-NEXT: [[TMP13:%.*]] = sext <4 x i8> [[TMP12]] to <4 x i32>
				; CHECK-NEXT: [[TMP14:%.*]] = sext <4 x i8> [[TMP10]] to <4 x i32>
				; CHECK-NEXT: [[TMP15:%.*]] = mul <4 x i32> [[TMP14]], [[TMP13]]
				; CHECK-NEXT: [[TMP16:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP15]])
				; CHECK-NEXT: [[TMP17:%.*]] = add i32 [[TMP8]], [[TMP16]]
				; CHECK-NEXT: [[TMP18]] = insertelement <256 x i32> [[VEC_PHI]], i32 [[TMP17]], i16 [[TMP3]]
				; CHECK-NEXT: br label [[TILEDPBSSD_UNROLL_INNER_LATCH]]
				; CHECK: tiledpbssd.unroll.inner.latch:
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_INNER_STEP]] = add i16 [[TILEDPBSSD_UNROLL_INNER_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_INNER_COND:%.*]] = icmp ne i16 [[TILEDPBSSD_UNROLL_INNER_STEP]], [[TMP1]]
				; CHECK-NEXT: br i1 [[TILEDPBSSD_UNROLL_INNER_COND]], label [[TILEDPBSSD_UNROLL_INNER_HEADER]], label [[TILEDPBSSD_UNROLL_COLS_LATCH]]
				; CHECK: tiledpbssd.unroll.cols.latch:
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_COLS_STEP]] = add i16 [[TILEDPBSSD_UNROLL_COLS_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_COLS_COND:%.*]] = icmp ne i16 [[TILEDPBSSD_UNROLL_COLS_STEP]], [[TMP0]]
				; CHECK-NEXT: br i1 [[TILEDPBSSD_UNROLL_COLS_COND]], label [[TILEDPBSSD_UNROLL_COLS_HEADER]], label [[TILEDPBSSD_UNROLL_ROWS_LATCH]]
				; CHECK: tiledpbssd.unroll.rows.latch:
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_ROWS_STEP]] = add i16 [[TILEDPBSSD_UNROLL_ROWS_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_UNROLL_ROWS_COND:%.]] = icmp ne i16 [[TILEDPBSSD_UNROLL_ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[TILEDPBSSD_UNROLL_ROWS_COND]], label [[TILEDPBSSD_UNROLL_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: store <256 x i32> [[TMP18]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%a.amx = bitcast <256 x i32> %a to x86_amx
				%b.amx = bitcast <256 x i32> %b to x86_amx
				%c.amx = bitcast <256 x i32> %c to x86_amx
				%acc = call x86_amx @llvm.x86.tdpbssd.internal(i16 %row, i16 %col, i16 %k, x86_amx %c.amx, x86_amx %a.amx, x86_amx %b.amx)
				%vec = bitcast x86_amx %acc to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				define dso_local void @test_amx_store(i16 signext %row, i16 signext %col, i8 %ptr, i64 %stride, <256 x i32> %vptr, <256 x i32> %vec) #0 {
				; CHECK-LABEL: @test_amx_store(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[AMX:%.]] = bitcast <256 x i32> [[VEC:%.]] to x86_amx
				; CHECK-NEXT: [[TMP0:%.]] = udiv i16 [[COL:%.]], 4
				; CHECK-NEXT: [[TMP1:%.]] = udiv i64 [[STRIDE:%.]], 4
				; CHECK-NEXT: br label [[TILESTORE_UNROLL_ROWS_HEADER:%.*]]
				; CHECK: tilestore.unroll.rows.header:
				; CHECK-NEXT: [[TILESTORE_UNROLL_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILESTORE_UNROLL_ROWS_STEP:%.]], [[TILESTORE_UNROLL_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: br label [[TILESTORE_UNROLL_ROWS_BODY:%.*]]
				; CHECK: tilestore.unroll.rows.body:
				; CHECK-NEXT: br label [[TILESTORE_UNROLL_COLS_HEADER:%.*]]
				; CHECK: tilestore.unroll.cols.header:
				; CHECK-NEXT: [[TILESTORE_UNROLL_COLS_IV:%.]] = phi i16 [ 0, [[TILESTORE_UNROLL_ROWS_BODY]] ], [ [[TILESTORE_UNROLL_COLS_STEP:%.]], [[TILESTORE_UNROLL_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: br label [[TILESTORE_UNROLL_COLS_BODY:%.*]]
				; CHECK: tilestore.unroll.cols.body:
				; CHECK-NEXT: [[TMP2:%.*]] = zext i16 [[TILESTORE_UNROLL_ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = zext i16 [[TILESTORE_UNROLL_COLS_IV]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP2]], [[TMP1]]
				; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[TMP4]], [[TMP3]]
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[PTR:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP6]], i64 [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.*]] = mul i16 [[TILESTORE_UNROLL_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP9:%.*]] = add i16 [[TMP8]], [[TILESTORE_UNROLL_COLS_IV]]
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <256 x i32> [[VEC]], i16 [[TMP9]]
				; CHECK-NEXT: store i32 [[TMP10]], i32* [[TMP7]], align 4
				; CHECK-NEXT: br label [[TILESTORE_UNROLL_COLS_LATCH]]
				; CHECK: tilestore.unroll.cols.latch:
				; CHECK-NEXT: [[TILESTORE_UNROLL_COLS_STEP]] = add i16 [[TILESTORE_UNROLL_COLS_IV]], 1
				; CHECK-NEXT: [[TILESTORE_UNROLL_COLS_COND:%.*]] = icmp ne i16 [[TILESTORE_UNROLL_COLS_STEP]], [[TMP0]]
				; CHECK-NEXT: br i1 [[TILESTORE_UNROLL_COLS_COND]], label [[TILESTORE_UNROLL_COLS_HEADER]], label [[TILESTORE_UNROLL_ROWS_LATCH]]
				; CHECK: tilestore.unroll.rows.latch:
				; CHECK-NEXT: [[TILESTORE_UNROLL_ROWS_STEP]] = add i16 [[TILESTORE_UNROLL_ROWS_IV]], 1
				; CHECK-NEXT: [[TILESTORE_UNROLL_ROWS_COND:%.]] = icmp ne i16 [[TILESTORE_UNROLL_ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[TILESTORE_UNROLL_ROWS_COND]], label [[TILESTORE_UNROLL_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = bitcast <256 x i32> %vec to x86_amx
				call void @llvm.x86.tilestored64.internal(i16 %row, i16 %col, i8* %ptr, i64 %stride, x86_amx %amx)
				ret void
				}

				define dso_local void @test_amx_zero(i16 signext %row, i16 signext %col, <256 x i32>* %vptr) #0 {
				; CHECK-LABEL: @test_amx_zero(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: store <256 x i32> zeroinitializer, <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = call x86_amx @llvm.x86.tilezero.internal(i16 %row, i16 %col)
				%vec = bitcast x86_amx %amx to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				declare x86_amx @llvm.x86.tilezero.internal(i16, i16)
				declare x86_amx @llvm.x86.tileloadd64.internal(i16, i16, i8*, i64)
				declare x86_amx @llvm.x86.tdpbssd.internal(i16, i16, i16, x86_amx, x86_amx, x86_amx)
				declare void @llvm.x86.tilestored64.internal(i16, i16, i8*, i64, x86_amx)

				attributes #0 = { noinline nounwind optnone }

llvm/test/CodeGen/X86/O0-pipeline.ll

	Show All 12 Lines
	; CHECK-NEXT: Create Garbage Collector Module Metadata			; CHECK-NEXT: Create Garbage Collector Module Metadata
	; CHECK-NEXT: Assumption Cache Tracker			; CHECK-NEXT: Assumption Cache Tracker
	; CHECK-NEXT: Profile summary info			; CHECK-NEXT: Profile summary info
	; CHECK-NEXT: Machine Branch Probability Analysis			; CHECK-NEXT: Machine Branch Probability Analysis
	; CHECK-NEXT: ModulePass Manager			; CHECK-NEXT: ModulePass Manager
	; CHECK-NEXT: Pre-ISel Intrinsic Lowering			; CHECK-NEXT: Pre-ISel Intrinsic Lowering
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	; CHECK-NEXT: Expand Atomic instructions			; CHECK-NEXT: Expand Atomic instructions
	; CHECK-NEXT: Lower AMX type for load/store			; CHECK-NEXT: Dominator Tree Construction
				; CHECK-NEXT: Natural Loop Information
				; CHECK-NEXT: Lower AMX intrinsics
	; CHECK-NEXT: Module Verifier			; CHECK-NEXT: Module Verifier
	; CHECK-NEXT: Lower Garbage Collection Instructions			; CHECK-NEXT: Lower Garbage Collection Instructions
	; CHECK-NEXT: Shadow Stack GC Lowering			; CHECK-NEXT: Shadow Stack GC Lowering
	; CHECK-NEXT: Lower constant intrinsics			; CHECK-NEXT: Lower constant intrinsics
	; CHECK-NEXT: Remove unreachable blocks from the CFG			; CHECK-NEXT: Remove unreachable blocks from the CFG
	; CHECK-NEXT: Instrument function entry/exit with calls to e.g. mcount() (post inlining)			; CHECK-NEXT: Instrument function entry/exit with calls to e.g. mcount() (post inlining)
	; CHECK-NEXT: Scalarize Masked Memory Intrinsics			; CHECK-NEXT: Scalarize Masked Memory Intrinsics
	; CHECK-NEXT: Expand reduction intrinsics			; CHECK-NEXT: Expand reduction intrinsics
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines