This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
2/3
Passes.h
-
lib/Target/X86/
-
Target/
-
X86/
-
CMakeLists.txt
-
X86.h
37/58
X86LowerAMXIntrinsics.cpp
1
X86LowerAMXType.cpp
2/3
X86TargetMachine.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
AMX/
2
amx-low-intrinsics-no-amx-bitcast.ll
1/2
amx-low-intrinsics.ll
2
amx-type.ll
-
O0-pipeline.ll
-
opt-pipeline.ll
-
tools/opt/
-
opt/
-
opt.cpp

Differential D93594

[X86] Pass to transform amx intrinsics to scalar operation.
ClosedPublic

Authored by yubing on Dec 20 2020, 4:58 AM.

Download Raw Diff

Details

Reviewers

LuoYuanke
pengfei
xiangzhangllvm
craig.topper

Commits

rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation.
rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation.

Summary

This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
should improve fast register allocation to allocate amx register.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60 ms	x64 debian > LLVM.CodeGen/X86/AMX::amx-low-intrinsics-no-amx-bitcast.ll
	70 ms	x64 debian > LLVM.CodeGen/X86/AMX::amx-low-intrinsics.ll

Event Timeline

LuoYuanke created this revision.Dec 20 2020, 4:58 AM

Herald added subscribers: nikic, pengfei, hiraditya, mgorny. · View Herald TranscriptDec 20 2020, 4:58 AM

LuoYuanke requested review of this revision.Dec 20 2020, 4:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2020, 4:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

LuoYuanke added a parent revision: D91927: [X86] Add x86_amx type for intel AMX..Dec 20 2020, 5:00 AM

LuoYuanke added subscribers: annita.zhang, LiuChen3.

Harbormaster completed remote builds in B83067: Diff 312970.Dec 20 2020, 5:49 AM

craig.topper added a subscriber: craig.topper.Dec 20 2020, 11:54 AM

craig.topper added inline comments.

llvm/lib/Target/X86/X86TargetMachine.cpp
420	I don't think you can detect O0 this way. A function can have the optnone attribute in the non-O0 pipeline and won't be optimized by the middle end. This can occur if you mix an 00 translation unit and an O3 translation unit in LTO and use O3 for the LTO pipeline.

LuoYuanke added inline comments.Dec 21 2020, 4:05 AM

llvm/lib/Target/X86/X86TargetMachine.cpp
420	@craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass and determine if the amx intrinsics in the function need to be scalarized?

Address Craig's comments and fix clang format issue.

Add test for fucntions without attribute optone.

Harbormaster completed remote builds in B83139: Diff 313085.Dec 21 2020, 5:51 AM

Harbormaster completed remote builds in B83138: Diff 313084.Dec 21 2020, 6:14 AM

Scalarize tilestore.

Harbormaster completed remote builds in B83364: Diff 313492.Dec 23 2020, 12:45 AM

Support tile_zero and fix bugs for tile_load and tile_store.

LuoYuanke added a subscriber: yubing.Jan 12 2021, 10:01 PM

Harbormaster completed remote builds in B84973: Diff 316320.Jan 12 2021, 10:31 PM

yubing commandeered this revision.Jan 28 2021, 12:10 AM

yubing added a reviewer: LuoYuanke.

Fix some bugs in lowerTileDPBSSD, lowerTileStore, lowerTileLoad

Herald added a project: Restricted Project. · View Herald TranscriptJan 28 2021, 2:15 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B86986: Diff 319797.Jan 28 2021, 3:40 AM

yubing added a reviewer: pengfei.Feb 4 2021, 11:57 PM

LuoYuanke added a reviewer: xiangzhangllvm.Feb 5 2021, 12:00 AM

LuoYuanke added a reviewer: craig.topper.Feb 5 2021, 12:02 AM

Would you rebase to see if the lit test failure is related to this patch?

Rebase and fix the bug in amx_api.c

yubing mentioned this in D96110: [X86] Pass to transform tdpbf16ps intrinsics to scalar operation..Feb 5 2021, 1:32 AM

Harbormaster completed remote builds in B88039: Diff 321673.Feb 5 2021, 1:42 AM

Strange. llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll can pass in my local machine.

yubing added inline comments.Feb 8 2021, 7:53 PM

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
78	Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero.

xiangzhangllvm added inline comments.Feb 9 2021, 1:12 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
357	I see you need force match bitcast then replace, add assert for no bitcast case
472	'\|' is bits or, use logic \|\|

pengfei added inline comments.Feb 9 2021, 1:45 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
434	`bool C = false`
440	We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast after e.g. x86_tileloadd64_internal. So we need to insert one bitcast as required.
471	Remove the `{}` for single line loop.
507	You can just return it by `return LAT.visit()`.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
61	Maybe we can use zero mask load in future optimization.

LuoYuanke added inline comments.Feb 9 2021, 4:24 AM

llvm/lib/Target/X86/X86TargetMachine.cpp
421	We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes.

xiangzhangllvm added inline comments.Feb 9 2021, 4:45 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213	In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0). Then can use vector to handle all of the them, let type legalization to split the type.

yubing added inline comments.Feb 19 2021, 9:46 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213	We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero.

Address the commments above.

Small fix for some code

yubing marked an inline comment as done.Feb 19 2021, 10:01 PM

Harbormaster completed remote builds in B90020: Diff 325152.Feb 19 2021, 10:49 PM

Harbormaster completed remote builds in B90023: Diff 325155.Feb 19 2021, 11:20 PM

LuoYuanke added inline comments.Feb 20 2021, 12:57 AM

llvm/include/llvm/CodeGen/Passes.h
504	Add comments to describe what the pass does?
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2	This seems wrong file name.
12	Type 'able'.
160	Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions.
251	Delete the dead code.
268	It should be in another line.
281	Better to be in a new line.
293	Better to be in a new line.
373	The name seems not good. Is "PreBuilder" better? And why we need two builder in the function?
378	Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation.
412	Is "PreBuilder" better?
416	Shift?
449	PreBuilder?
488	Do we iterate the instructions in topology order or in post order?

pengfei added inline comments.Feb 21 2021, 7:22 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
491	Should be better to use if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst->getIntrinsicID()) { case Intrinsic::x86_tdpbssd_internal: ...
502	ditto

Address comments above

yubing marked 13 inline comments as done.Feb 22 2021, 9:54 PM

yubing added inline comments.

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
488	It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll

Harbormaster completed remote builds in B90331: Diff 325670.Feb 22 2021, 10:47 PM

Fix some comments and commit message

yubing marked an inline comment as done.Feb 23 2021, 7:19 PM

yubing edited the summary of this revision. (Show Details)Feb 23 2021, 7:22 PM

Harbormaster completed remote builds in B90524: Diff 325964.Feb 23 2021, 9:59 PM

pengfei added inline comments.Feb 24 2021, 5:03 AM

llvm/include/llvm/CodeGen/Passes.h
504	transforms
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2	We usually comment as ===--- filename - description ---=== See `head -n1 llvm/lib/Target/X86/*.cpp`
52	Ctx
89	Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.
100	Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop.
167	Maybe we can just use cast to help to raise the assertion.
224	You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.
230	ditto
232	Should check it is V256I32?
233	ditto
289	eltc?
312	Is it necessary to insert the ResElt to VecC?
341	TileLoadStore
342	Forgot to remove?
344	ditto
388	ditto
392	ditto
llvm/lib/Target/X86/X86LowerAMXType.cpp
337	ditto
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
1 ↗	(On Diff #325964)	Better name it amx-low-intrinsics-no-amx-bitcast.ll
13 ↗	(On Diff #325964)	It seems the body block is not necessary
19 ↗	(On Diff #325964)	ditto. The lable `TILELOAD_SCALARIZE_COLS_BODY` even not been used.
31 ↗	(On Diff #325964)	I think cols.latch is not necessary either.

yubing added inline comments.Feb 24 2021, 6:50 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct: Cij =Cij+Ai1.B1j Cij =Cij+Ai2.B2j .... Cij =Cij+AiK.*BKj

pengfei added inline comments.Feb 24 2021, 7:37 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	But you don't need to update both C and D. Something like the psudo code should enough: for (k : K) Dij += Aik * Bkj; Dij += Cij

LuoYuanke added inline comments.Feb 27 2021, 4:49 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	Why do we need a template instead of passing a parameter `bool IsLoad`?

pengfei added inline comments.Feb 27 2021, 5:36 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	Bing thought template instantiation can avoid the condition code to turn into branch instructions.

LuoYuanke added inline comments.Feb 27 2021, 5:40 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)

yubing added inline comments.Feb 28 2021, 9:09 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
100	I think We can compose a follow-up patch for this optimization

address comments above

yubing marked 15 inline comments as done.Mar 1 2021, 11:21 PM

yubing added inline comments.

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	I change code into the following style, and it can also reduce inner loop's size: for (k : K) Cij += Aik * Bkj; Dij = Cij Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
13 ↗	(On Diff #325964)	In fact, ISEL PASS can merge basicblocks together.

LuoYuanke added inline comments.Mar 2 2021, 1:24 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C.

Harbormaster completed remote builds in B91505: Diff 327362.Mar 2 2021, 2:04 AM

yubing added inline comments.Mar 2 2021, 7:30 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	But your solution still need to update D so D's phi will be kept in the inner loops.

LGTM with some nitpicks 😊

llvm/include/llvm/CodeGen/Passes.h
504	transforms
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
10	operations
11	We always enable it. Also need to mention optnone.
47	Curly brackets are not necessary here.
83	Do we need to remove the successor? Isn't it still being dominated?
97	IsTileLoad
118	Use 1 directly?
122	Better use the same naming conversion, i.e. `ColLoopHeader`
126–127	Better to change the order, e.g. Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty = FixedVectorType::get(EltTy, 256);
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
2	I think we should move the files to llvm/test/Transforms/
llvm/test/CodeGen/X86/AMX/amx-type.ll
2	Why adding this? Is it O2 by default?

This revision is now accepted and ready to land.Mar 3 2021, 5:52 AM

LuoYuanke added inline comments.Mar 4 2021, 5:09 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
83	I think this is to remove edge from preheader to tmp, because we insert a loop between them.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
2	Not sure about it. Our .cpp code is under lib/Target/X86/ folder.
llvm/test/CodeGen/X86/AMX/amx-type.ll
2	I think this is to test with opt level 2 this pass do nothing.

Address pengfei's comments

LGTM too.

This revision was landed with ongoing or failed builds.Mar 5 2021, 12:02 AM

Closed by commit rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation. (authored by LuoYuanke, committed by yubing). · Explain Why

This revision was automatically updated to reflect the committed changes.

yubing added a commit: rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation..

This seems to break the build https://buildkite.com/mlir/mlir-core/builds/12026#91ec4dfe-542f-4312-92db-7d555f05ce06.

I could repro locally, reverting locally fixes the build.

Please address, thanks!

RKSimon added a reverting change: rG3fd2fa122059: Revert rG8198d83965ba4b9db6922b44ef3041030b2bac39: "[X86] Pass to transform amx….Mar 5 2021, 3:09 AM

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

In D93594#2606157, @RKSimon wrote:

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Thanks – I was just about to point out this broke downstream testing too.

Thanks all for reporting and reverting this.

Harbormaster completed remote builds in B92239: Diff 328408.Mar 5 2021, 4:55 PM

Thanks all for reporting and reverting this. I will do bugfix asap.

In D93594#2606157, @RKSimon wrote:

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Hi, @RKSimon @nicolasvasilache , it seems we haven't told libLLVMX86CodeGen.so.13git to link TransformUtils inllvm/lib/Target/X86/CMakeLists.txt, That's why we encounter buildfail.
But There is a strange thing which can be observed in build.ninja :
When I cmake with "-DBUILD_SHARED_LIBS=OFF", libLLVMX86CodeGen.a will still link lib/libLLVMTransformUtils.a.
When I cmake with "-DBUILD_SHARED_LIBS=ON", libLLVMX86CodeGen.so.13git won't link TransformUtils.
Is there any difference in build system for static library and shared library?

yubing reopened this revision.Mar 8 2021, 7:30 PM

This revision is now accepted and ready to land.Mar 8 2021, 7:30 PM

Fix buildfail when it is -DBUILD_SHARED_LIBS=ON

Harbormaster completed remote builds in B92791: Diff 329204.Mar 9 2021, 4:01 AM

This revision was landed with ongoing or failed builds.Mar 15 2021, 7:41 PM

Closed by commit rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation. (authored by yubing). · Explain Why

This revision was automatically updated to reflect the committed changes.

yubing added a commit: rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation..

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

In D93594#2628497, @nikic wrote:

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

Thanks, @nikic, I will fix it ASAP. Besides, How could I reproduce the regression?
Eh, I am asking these question because I think I should see if the repression can't be reproduced with my future bugfix.

@yubing In this case I would recommend building sqlite3.c from test-suite under perf stat and look at the instructions metric. For me the command looks like this:

perf stat CLANG_BINARY   -w -Werror=date-time -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DSQLITE_OMIT_LOAD_EXTENSION=1 -DSQLITE_THREADSAFE=0 -I. -MD -MT MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o -MF MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o.d -o MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o   -c ../MultiSource/Applications/sqlite3/sqlite3.c

You can generally get a build command using ninja -v sqlite3 in test-suite.

I can reproduce the regression. I'll help to fix it.

LuoYuanke mentioned this in D98773: [X86] Fix compile time regression of D93594..Mar 17 2021, 4:24 AM

The fix is uploaded at https://reviews.llvm.org/D98773.

davezarzycki removed a subscriber: davezarzycki.Mar 17 2021, 6:46 AM

LuoYuanke mentioned this in rGe64adc0b88c2: [X86] Fix compile time regression of D93594..Mar 18 2021, 1:53 AM

yubing mentioned this in rG113f077f808f: [X86] Pass to transform tdpbf16ps intrinsics to scalar operation..Mar 21 2021, 10:01 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

Passes.h

6 lines

lib/

Target/

X86/

CMakeLists.txt

1 line

X86.h

1 line

X86LowerAMXIntrinsics.cpp

538 lines

X86LowerAMXType.cpp

8 lines

X86TargetMachine.cpp

5 lines

test/

CodeGen/

X86/

AMX/

amx-low-intrinsics-no-amx-bitcast.ll

211 lines

amx-low-intrinsics.ll

237 lines

amx-type.ll

2 lines

O0-pipeline.ll

3 lines

opt-pipeline.ll

5 lines

tools/

opt/

opt.cpp

3 lines

Diff 328408

llvm/include/llvm/CodeGen/Passes.h

Show First 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	/// MachineDominanaceFrontier - This pass is a machine dominators analysis pass.

/// Creates MIR Check Debug pass. \see MachineCheckDebugify.cpp		/// Creates MIR Check Debug pass. \see MachineCheckDebugify.cpp
ModulePass *createCheckDebugMachineModulePass();		ModulePass *createCheckDebugMachineModulePass();

/// The pass fixups statepoint machine instruction to replace usage of		/// The pass fixups statepoint machine instruction to replace usage of
/// caller saved registers with stack slots.		/// caller saved registers with stack slots.
extern char &FixupStatepointCallerSavedID;		extern char &FixupStatepointCallerSavedID;

/// The pass transform load/store <256 x i32> to AMX load/store intrinsics		/// The pass transforms load/store <256 x i32> to AMX load/store intrinsics
/// or split the data to two <128 x i32>.		/// or split the data to two <128 x i32>.
FunctionPass *createX86LowerAMXTypePass();		FunctionPass *createX86LowerAMXTypePass();

		/// The pass transforms amx intrinsics to scalar operation if the function has
		LuoYuankeUnsubmitted Done Reply Inline Actions Add comments to describe what the pass does? LuoYuanke: Add comments to describe what the pass does?
		pengfeiUnsubmitted Done Reply Inline Actions transforms pengfei: transforms
		pengfeiUnsubmitted Not Done Reply Inline Actions transforms pengfei: transforms
		/// optnone attribute or it is O0.
		FunctionPass *createX86LowerAMXIntrinsicsPass();
} // End llvm namespace		} // End llvm namespace

#endif		#endif

llvm/lib/Target/X86/CMakeLists.txt

Show All 28 Lines	set(sources
X86CallFrameOptimization.cpp		X86CallFrameOptimization.cpp
X86CallingConv.cpp		X86CallingConv.cpp
X86CallLowering.cpp		X86CallLowering.cpp
X86CmovConversion.cpp		X86CmovConversion.cpp
X86DomainReassignment.cpp		X86DomainReassignment.cpp
X86DiscriminateMemOps.cpp		X86DiscriminateMemOps.cpp
X86LowerTileCopy.cpp		X86LowerTileCopy.cpp
X86LowerAMXType.cpp		X86LowerAMXType.cpp
		X86LowerAMXIntrinsics.cpp
X86TileConfig.cpp		X86TileConfig.cpp
X86PreTileConfig.cpp		X86PreTileConfig.cpp
X86ExpandPseudo.cpp		X86ExpandPseudo.cpp
X86FastISel.cpp		X86FastISel.cpp
X86FixupBWInsts.cpp		X86FixupBWInsts.cpp
X86FixupLEAs.cpp		X86FixupLEAs.cpp
X86AvoidStoreForwardingBlocks.cpp		X86AvoidStoreForwardingBlocks.cpp
X86FixupSetCC.cpp		X86FixupSetCC.cpp
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86.h

	Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines
	void initializeX86OptimizeLEAPassPass(PassRegistry &);			void initializeX86OptimizeLEAPassPass(PassRegistry &);
	void initializeX86PartialReductionPass(PassRegistry &);			void initializeX86PartialReductionPass(PassRegistry &);
	void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);			void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);
	void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);			void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);
	void initializeX86PreTileConfigPass(PassRegistry &);			void initializeX86PreTileConfigPass(PassRegistry &);
	void initializeX86TileConfigPass(PassRegistry &);			void initializeX86TileConfigPass(PassRegistry &);
	void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);			void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);
	void initializeX86LowerTileCopyPass(PassRegistry &);			void initializeX86LowerTileCopyPass(PassRegistry &);
				void initializeX86LowerAMXIntrinsicsLegacyPassPass(PassRegistry &);

	namespace X86AS {			namespace X86AS {
	enum : unsigned {			enum : unsigned {
	GS = 256,			GS = 256,
	FS = 257,			FS = 257,
	SS = 258,			SS = 258,
	PTR32_SPTR = 270,			PTR32_SPTR = 270,
	PTR32_UPTR = 271,			PTR32_UPTR = 271,
	PTR64 = 272			PTR64 = 272
	};			};
	} // End X86AS namespace			} // End X86AS namespace

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp

This file was added.

				//===-- X86LowerAMXIntrinsics.cpp -X86 Scalarize AMX Intrinsics------------===//
				//
				LuoYuankeUnsubmitted Done Reply Inline Actions This seems wrong file name. LuoYuanke: This seems wrong file name.
				pengfeiUnsubmitted Done Reply Inline Actions We usually comment as ===--- filename - description ---=== See `head -n1 llvm/lib/Target/X86/.cpp` pengfei:* We usually comment as //===--- filename - description ---===// See `head -n1…
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file Pass to transform amx intrinsics to scalar operations.
				/// This pass is always enabled and it skips when it is not -O0 and has no
				pengfeiUnsubmitted Not Done Reply Inline Actions operations pengfei: operations
				/// optnone attributes. With -O0 or optnone attribute, the def of shape to amx
				pengfeiUnsubmitted Not Done Reply Inline Actions We always enable it. Also need to mention optnone. pengfei: We always enable it. Also need to mention optnone.
				/// intrinsics is near the amx intrinsics code. We are not able to find a
				LuoYuankeUnsubmitted Done Reply Inline Actions Type 'able'. LuoYuanke: Type 'able'.
				/// point which post-dominate all the shape and dominate all amx intrinsics.
				/// To decouple the dependency of the shape, we transform amx intrinsics
				/// to scalar operation, so that compiling doesn't fail. In long term, we
				/// should improve fast register allocation to allocate amx register.
				//===----------------------------------------------------------------------===//
				//
				#include "X86.h"
				#include "llvm/ADT/DenseSet.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/Analysis/DomTreeUpdater.h"
				#include "llvm/Analysis/OptimizationRemarkEmitter.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/CodeGen/Passes.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/CodeGen/ValueTypes.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/IntrinsicsX86.h"
				#include "llvm/IR/PatternMatch.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Pass.h"
				#include "llvm/Target/TargetMachine.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/LoopUtils.h"

				using namespace llvm;
				using namespace PatternMatch;

				#define DEBUG_TYPE "lower-amx-intrinsics"

				static bool isV256I32Ty(Type *Ty) {
				if (auto *FVT = dyn_cast<FixedVectorType>(Ty))
				pengfeiUnsubmitted Not Done Reply Inline Actions Curly brackets are not necessary here. pengfei: Curly brackets are not necessary here.
				return FVT->getNumElements() == 256 &&
				FVT->getElementType()->isIntegerTy(32);
				return false;
				}

				pengfeiUnsubmitted Done Reply Inline Actions Ctx pengfei: Ctx
				static BasicBlock createLoop(BasicBlock Preheader, BasicBlock *Exit,
				Value Bound, Value Step, StringRef Name,
				IRBuilderBase &B, DomTreeUpdater &DTU, Loop *L,
				LoopInfo &LI) {
				LLVMContext &Ctx = Preheader->getContext();
				BasicBlock *Header =
				BasicBlock::Create(Ctx, Name + ".header", Preheader->getParent(), Exit);
				BasicBlock *Body =
				BasicBlock::Create(Ctx, Name + ".body", Header->getParent(), Exit);
				BasicBlock *Latch =
				BasicBlock::Create(Ctx, Name + ".latch", Header->getParent(), Exit);

				Type *I16Ty = Type::getInt16Ty(Ctx);
				BranchInst::Create(Body, Header);
				BranchInst::Create(Latch, Body);
				PHINode *IV =
				PHINode::Create(I16Ty, 2, Name + ".iv", Header->getTerminator());
				IV->addIncoming(ConstantInt::get(I16Ty, 0), Preheader);

				B.SetInsertPoint(Latch);
				Value *Inc = B.CreateAdd(IV, Step, Name + ".step");
				Value *Cond = B.CreateICmpNE(Inc, Bound, Name + ".cond");
				BranchInst::Create(Header, Exit, Cond, Latch);
				IV->addIncoming(Inc, Latch);

				BranchInst *PreheaderBr = cast<BranchInst>(Preheader->getTerminator());
				BasicBlock *Tmp = PreheaderBr->getSuccessor(0);
				PreheaderBr->setSuccessor(0, Header);
				DTU.applyUpdatesPermissive({
				{DominatorTree::Delete, Preheader, Tmp},
				{DominatorTree::Insert, Header, Body},
				pengfeiUnsubmitted Not Done Reply Inline Actions Do we need to remove the successor? Isn't it still being dominated? pengfei: Do we need to remove the successor? Isn't it still being dominated?
				LuoYuankeUnsubmitted Not Done Reply Inline Actions I think this is to remove edge from preheader to tmp, because we insert a loop between them. LuoYuanke: I think this is to remove edge from preheader to tmp, because we insert a loop between them.
				{DominatorTree::Insert, Body, Latch},
				{DominatorTree::Insert, Latch, Header},
				{DominatorTree::Insert, Latch, Exit},
				{DominatorTree::Insert, Preheader, Header},
				});

				pengfeiUnsubmitted Done Reply Inline Actions Can we just use `template <bool IsLoad>`? I think it also can reduce the branch. pengfei: Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.
				LuoYuankeUnsubmitted Not Done Reply Inline Actions Why do we need a template instead of passing a parameter `bool IsLoad`? LuoYuanke: Why do we need a template instead of passing a parameter `bool IsLoad`?
				pengfeiUnsubmitted Not Done Reply Inline Actions Bing thought template instantiation can avoid the condition code to turn into branch instructions. pengfei: Bing thought template instantiation can avoid the condition code to turn into branch…
				LuoYuankeUnsubmitted Not Done Reply Inline Actions That may be arguable what benefit more. Code size saving or branch instructions avoiding. :) LuoYuanke: That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)
				L->addBasicBlockToLoop(Header, LI);
				L->addBasicBlockToLoop(Body, LI);
				L->addBasicBlockToLoop(Latch, LI);
				return Body;
				}

				template <bool IsTileLoad>
				static Value createTileLoadStoreLoops(BasicBlock Start, BasicBlock *End,
				pengfeiUnsubmitted Not Done Reply Inline Actions IsTileLoad pengfei: IsTileLoad
				IRBuilderBase &B, DomTreeUpdater &DTU,
				LoopInfo &LI, Value Row, Value Col,
				Value Ptr, Value Stride, Value *Tile) {
				pengfeiUnsubmitted Not Done Reply Inline Actions Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop. pengfei: Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can…
				yubingAuthorUnsubmitted Done Reply Inline Actions I think We can compose a follow-up patch for this optimization yubing: I think We can compose a follow-up patch for this optimization
				std::string IntrinName = IsTileLoad ? "tileload" : "tilestore";
				Loop *RowLoop = LI.AllocateLoop();
				Loop *ColLoop = LI.AllocateLoop();
				RowLoop->addChildLoop(ColLoop);
				if (Loop *ParentL = LI.getLoopFor(Start))
				ParentL->addChildLoop(RowLoop);
				else
				LI.addTopLevelLoop(RowLoop);

				BasicBlock *RowBody =
				createLoop(Start, End, Row, B.getInt16(1), IntrinName + ".scalarize.rows",
				B, DTU, RowLoop, LI);
				BasicBlock *RowLatch = RowBody->getSingleSuccessor();

				BasicBlock *ColBody =
				createLoop(RowBody, RowLatch, Col, B.getInt16(1),
				IntrinName + ".scalarize.cols", B, DTU, ColLoop, LI);

				pengfeiUnsubmitted Not Done Reply Inline Actions Use 1 directly? pengfei: Use 1 directly?
				BasicBlock *ColLoopLatch = ColBody->getSingleSuccessor();
				BasicBlock *ColLoopHeader = ColBody->getSinglePredecessor();
				BasicBlock *RowLoopHeader = RowBody->getSinglePredecessor();
				Value CurrentRow = &RowLoopHeader->begin();
				pengfeiUnsubmitted Not Done Reply Inline Actions Better use the same naming conversion, i.e. `ColLoopHeader` pengfei: Better use the same naming conversion, i.e. `ColLoopHeader`
				Value CurrentCol = &ColLoopHeader->begin();
				Type *EltTy = B.getInt32Ty();
				FixedVectorType *V256I32Ty = FixedVectorType::get(EltTy, 256);

				// Common part for tileload and tilestore
				pengfeiUnsubmitted Not Done Reply Inline Actions Better to change the order, e.g. Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty = FixedVectorType::get(EltTy, 256); pengfei: Better to change the order, e.g. ``` Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty =…
				// *.scalarize.cols.body:
				// Calculate %idxmem and %idxvec
				B.SetInsertPoint(ColBody->getTerminator());
				Value *CurrentRowZExt = B.CreateZExt(CurrentRow, Stride->getType());
				Value *CurrentColZExt = B.CreateZExt(CurrentCol, Stride->getType());
				Value *Offset =
				B.CreateAdd(B.CreateMul(CurrentRowZExt, Stride), CurrentColZExt);
				unsigned AS = cast<PointerType>(Ptr->getType())->getAddressSpace();
				Value *EltBasePtr = B.CreatePointerCast(Ptr, PointerType::get(EltTy, AS));
				Value *EltPtr = B.CreateGEP(EltTy, EltBasePtr, Offset);
				Value *Idx = B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentCol);
				if (IsTileLoad) {
				// tileload.scalarize.rows.header:
				// %vec.phi.row = phi <256 x i32> [ zeroinitializer, %entry ], [ %ResVec,
				// %tileload.scalarize.rows.latch ]
				B.SetInsertPoint(RowLoopHeader->getTerminator());
				Value *VecZero = Constant::getNullValue(V256I32Ty);
				PHINode *VecCPhiRowLoop = B.CreatePHI(V256I32Ty, 2, "vec.phi.row");
				VecCPhiRowLoop->addIncoming(VecZero, Start);

				// tileload.scalarize.cols.header:
				// %vec.phi = phi <256 x i32> [ %vec.phi.row, %tileload.scalarize.rows.body
				// ], [ %ResVec, %tileload.scalarize.cols.latch ]
				B.SetInsertPoint(ColLoopHeader->getTerminator());
				PHINode *VecPhi = B.CreatePHI(V256I32Ty, 2, "vec.phi");
				VecPhi->addIncoming(VecCPhiRowLoop, RowBody);

				// tileload.scalarize.cols.body:
				// Calculate %idxmem and %idxvec
				// %eltptr = getelementptr i32, i32* %base, i64 %idxmem
				// %elt = load i32, i32* %ptr
				// %ResVec = insertelement <256 x i32> %vec.phi, i32 %elt, i16 %idxvec
				B.SetInsertPoint(ColBody->getTerminator());
				LuoYuankeUnsubmitted Done Reply Inline Actions Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions. LuoYuanke: Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so…
				Value *Elt = B.CreateLoad(EltTy, EltPtr);
				Value *ResVec = B.CreateInsertElement(VecPhi, Elt, Idx);
				VecPhi->addIncoming(ResVec, ColLoopLatch);
				VecCPhiRowLoop->addIncoming(ResVec, RowLatch);

				return ResVec;
				} else {
				pengfeiUnsubmitted Done Reply Inline Actions Maybe we can just use cast to help to raise the assertion. pengfei: Maybe we can just use cast to help to raise the assertion.
				auto *BitCast = cast<BitCastInst>(Tile);
				Value *Vec = BitCast->getOperand(0);
				assert(isV256I32Ty(Vec->getType()) && "bitcast from non-v256i32 to x86amx");
				// tilestore.scalarize.cols.body:
				// %mul = mul i16 %row.iv, i16 16
				// %idx = add i16 %mul, i16 %col.iv
				// %vec = extractelement <16 x i32> %vec, i16 %idx
				// store i32 %vec, i32* %ptr
				B.SetInsertPoint(ColBody->getTerminator());
				Value *Elt = B.CreateExtractElement(Vec, Idx);

				B.CreateStore(Elt, EltPtr);
				return nullptr;
				}
				}

				static Value createTileDPBSSDLoops(BasicBlock Start, BasicBlock *End,
				IRBuilderBase &B, DomTreeUpdater &DTU,
				LoopInfo &LI, Value Row, Value Col,
				Value K, Value Acc, Value *LHS,
				Value *RHS) {
				Loop *RowLoop = LI.AllocateLoop();
				Loop *ColLoop = LI.AllocateLoop();
				Loop *InnerLoop = LI.AllocateLoop();
				ColLoop->addChildLoop(InnerLoop);
				RowLoop->addChildLoop(ColLoop);
				if (Loop *ParentL = LI.getLoopFor(Start))
				ParentL->addChildLoop(RowLoop);
				else
				LI.addTopLevelLoop(RowLoop);

				BasicBlock *RowBody =
				createLoop(Start, End, Row, B.getInt16(1), "tiledpbssd.scalarize.rows", B,
				DTU, RowLoop, LI);
				BasicBlock *RowLatch = RowBody->getSingleSuccessor();

				BasicBlock *ColBody =
				createLoop(RowBody, RowLatch, Col, B.getInt16(1),
				"tiledpbssd.scalarize.cols", B, DTU, ColLoop, LI);
				BasicBlock *ColLoopLatch = ColBody->getSingleSuccessor();

				B.SetInsertPoint(ColBody->getTerminator());
				BasicBlock *InnerBody =
				createLoop(ColBody, ColLoopLatch, K, B.getInt16(1),
				"tiledpbssd.scalarize.inner", B, DTU, InnerLoop, LI);

				xiangzhangllvmUnsubmitted Not Done Reply Inline Actions In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0). Then can use vector to handle all of the them, let type legalization to split the type. xiangzhangllvm: In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is…
				yubingAuthorUnsubmitted Done Reply Inline Actions We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero. yubing: We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0…
				BasicBlock *ColLoopHeader = ColBody->getSinglePredecessor();
				BasicBlock *RowLoopHeader = RowBody->getSinglePredecessor();
				BasicBlock *InnerLoopHeader = InnerBody->getSinglePredecessor();
				BasicBlock *InnerLoopLatch = InnerBody->getSingleSuccessor();
				Value CurrentRow = &RowLoopHeader->begin();
				Value CurrentCol = &ColLoopHeader->begin();
				Value CurrentInner = &InnerLoopHeader->begin();

				FixedVectorType *V256I32Ty = FixedVectorType::get(B.getInt32Ty(), 256);
				auto *BitCastAcc = cast<BitCastInst>(Acc);
				Value *VecC = BitCastAcc->getOperand(0);
				pengfeiUnsubmitted Done Reply Inline Actions You can use cast to help to check the failure so that VecA/B/C won't be uninitialized. pengfei: You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.
				assert(isV256I32Ty(VecC->getType()) && "bitcast from non-v256i32 to x86amx");
				// TODO else create BitCast from x86amx to v256i32.
				// Store x86amx to memory, and reload from memory
				// to vector. However with -O0, it doesn't happen.
				auto *BitCastLHS = cast<BitCastInst>(LHS);
				Value *VecA = BitCastLHS->getOperand(0);
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				assert(isV256I32Ty(VecA->getType()) && "bitcast from non-v256i32 to x86amx");
				auto *BitCastRHS = cast<BitCastInst>(RHS);
				pengfeiUnsubmitted Done Reply Inline Actions Should check it is V256I32? pengfei: Should check it is V256I32?
				Value *VecB = BitCastRHS->getOperand(0);
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				assert(isV256I32Ty(VecB->getType()) && "bitcast from non-v256i32 to x86amx");

				// tiledpbssd.scalarize.rows.header:
				// %vec.c.phi.row = phi <256 x i32> [ %VecC, %continue ], [ %NewVecC,
				// %tiledpbssd.scalarize.rows.latch ]

				// %vec.d.phi.row = phi <256 x i32> [ zeroinitializer, %continue ], [
				// %NewVecD, %tiledpbssd.scalarize.rows.latch ]
				B.SetInsertPoint(RowLoopHeader->getTerminator());
				PHINode *VecCPhiRowLoop = B.CreatePHI(V256I32Ty, 2, "vec.c.phi.row");
				VecCPhiRowLoop->addIncoming(VecC, Start);
				Value *VecZero = Constant::getNullValue(V256I32Ty);
				PHINode *VecDPhiRowLoop = B.CreatePHI(V256I32Ty, 2, "vec.d.phi.row");
				VecDPhiRowLoop->addIncoming(VecZero, Start);

				// tiledpbssd.scalarize.cols.header:
				// %vec.c.phi.col = phi <256 x i32> [ %vec.c.phi.row,
				// %tiledpbssd.scalarize.rows.body ], [ %NewVecC,
				LuoYuankeUnsubmitted Done Reply Inline Actions Delete the dead code. LuoYuanke: Delete the dead code.
				// %tiledpbssd.scalarize.cols.latch ]

				// %vec.d.phi.col = phi <256 x i32> [
				// %vec.d.phi.row, %tiledpbssd.scalarize.rows.body ], [ %NewVecD,
				// %tiledpbssd.scalarize.cols.latch ]

				// calculate idxc.
				B.SetInsertPoint(ColLoopHeader->getTerminator());
				PHINode *VecCPhiColLoop = B.CreatePHI(V256I32Ty, 2, "vec.c.phi.col");
				VecCPhiColLoop->addIncoming(VecCPhiRowLoop, RowBody);
				PHINode *VecDPhiColLoop = B.CreatePHI(V256I32Ty, 2, "vec.d.phi.col");
				VecDPhiColLoop->addIncoming(VecDPhiRowLoop, RowBody);
				Value *IdxC =
				B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentCol);

				// tiledpbssd.scalarize.inner.header:
				// %vec.c.inner.phi = phi <256 x i32> [ %vec.c.phi.col,
				LuoYuankeUnsubmitted Done Reply Inline Actions It should be in another line. LuoYuanke: It should be in another line.
				// %tiledpbssd.scalarize.cols.body ], [ %NewVecC,
				// %tiledpbssd.scalarize.inner.latch ]

				B.SetInsertPoint(InnerLoopHeader->getTerminator());
				PHINode *VecCPhi = B.CreatePHI(V256I32Ty, 2, "vec.c.inner.phi");
				VecCPhi->addIncoming(VecCPhiColLoop, ColBody);

				// tiledpbssd.scalarize.inner.body:
				// calculate idxa, idxb
				// %eltc = extractelement <256 x i32> %vec.c.inner.phi, i16 %idxc
				// %elta = extractelement <256 x i32> %veca, i16 %idxa
				// %eltav4i8 = bitcast i32 %elta to <4 x i8>
				// %eltb = extractelement <256 x i32> %vecb, i16 %idxb
				LuoYuankeUnsubmitted Done Reply Inline Actions Better to be in a new line. LuoYuanke: Better to be in a new line.
				// %eltbv4i8 = bitcast i32 %eltb to <4 x i8>
				// %eltav4i32 = sext <4 x i8> %eltav4i8 to <4 x i32>
				// %eltbv4i32 = sext <4 x i8> %eltbv4i8 to <4 x i32>
				// %mulab = mul <4 x i32> %eltbv4i32, %eltav4i32
				// %acc = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %131)
				// %neweltc = add i32 %elt, %acc
				// %NewVecC = insertelement <256 x i32> %vec.c.inner.phi, i32 %neweltc,
				// i16 %idxc
				pengfeiUnsubmitted Not Done Reply Inline Actions eltc? pengfei: eltc?

				B.SetInsertPoint(InnerBody->getTerminator());
				Value *IdxA =
				B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentInner);
				LuoYuankeUnsubmitted Done Reply Inline Actions Better to be in a new line. LuoYuanke: Better to be in a new line.
				Value *IdxB =
				B.CreateAdd(B.CreateMul(CurrentInner, B.getInt16(16)), CurrentCol);

				FixedVectorType *V4I8Ty = FixedVectorType::get(B.getInt8Ty(), 4);
				FixedVectorType *V4I32Ty = FixedVectorType::get(B.getInt32Ty(), 4);
				Value *EltC = B.CreateExtractElement(VecCPhi, IdxC);
				Value *EltA = B.CreateExtractElement(VecA, IdxA);
				Value *SubVecA = B.CreateBitCast(EltA, V4I8Ty);
				Value *EltB = B.CreateExtractElement(VecB, IdxB);
				Value *SubVecB = B.CreateBitCast(EltB, V4I8Ty);
				Value *SubVecR = B.CreateAddReduce(B.CreateMul(
				B.CreateSExt(SubVecA, V4I32Ty), B.CreateSExt(SubVecB, V4I32Ty)));
				Value *ResElt = B.CreateAdd(EltC, SubVecR);
				Value *NewVecC = B.CreateInsertElement(VecCPhi, ResElt, IdxC);

				// tiledpbssd.scalarize.cols.latch:
				// %NewEltC = extractelement <256 x i32> %vec.c.phi.col, i16 %idxc
				// %NewVecD = insertelement <256 x i32> %vec.d.phi.col, i32 %NewEltC,
				// i16 %idxc
				pengfeiUnsubmitted Not Done Reply Inline Actions Is it necessary to insert the ResElt to VecC? pengfei: Is it necessary to insert the ResElt to VecC?
				yubingAuthorUnsubmitted Done Reply Inline Actions Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct: Cij =Cij+Ai1.B1j Cij =Cij+Ai2.B2j .... Cij =Cij+AiK.BKj yubing:* Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix…
				pengfeiUnsubmitted Not Done Reply Inline Actions But you don't need to update both C and D. Something like the psudo code should enough: for (k : K) Dij += Aik * Bkj; Dij += Cij pengfei: But you don't need to update both C and D. Something like the psudo code should enough: ``` for…
				yubingAuthorUnsubmitted Done Reply Inline Actions I change code into the following style, and it can also reduce inner loop's size: for (k : K) Cij += Aik * Bkj; Dij = Cij Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops. yubing: I change code into the following style, and it can also reduce inner loop's size: ``` for (k…
				LuoYuankeUnsubmitted Not Done Reply Inline Actions It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C. LuoYuanke: It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert…
				yubingAuthorUnsubmitted Done Reply Inline Actions But your solution still need to update D so D's phi will be kept in the inner loops. yubing: But your solution still need to update D so D's phi will be kept in the inner loops.
				B.SetInsertPoint(ColLoopLatch->getTerminator());
				Value *NewEltC = B.CreateExtractElement(NewVecC, IdxC);
				Value *NewVecD = B.CreateInsertElement(VecDPhiColLoop, NewEltC, IdxC);

				VecCPhi->addIncoming(NewVecC, InnerLoopLatch);
				VecCPhiRowLoop->addIncoming(NewVecC, RowLatch);
				VecCPhiColLoop->addIncoming(NewVecC, ColLoopLatch);
				VecDPhiRowLoop->addIncoming(NewVecD, RowLatch);
				VecDPhiColLoop->addIncoming(NewVecD, ColLoopLatch);

				return NewVecD;
				}

				namespace {
				class X86LowerAMXIntrinsics {
				Function &Func;

				public:
				X86LowerAMXIntrinsics(Function &F, DominatorTree DT, LoopInfo LI)
				: Func(F), DT(DT), LI(LI) {}
				bool visit();

				private:
				DominatorTree *DT;
				LoopInfo *LI;
				template <bool IsTileLoad>
				bool lowerTileLoadStore(Instruction *TileLoadStore);
				bool lowerTileDPBSSD(Instruction *TileDPBSSD);
				bool lowerTileZero(Instruction *TileZero);
				pengfeiUnsubmitted Done Reply Inline Actions TileLoadStore pengfei: TileLoadStore
				};
				pengfeiUnsubmitted Done Reply Inline Actions Forgot to remove? pengfei: Forgot to remove?

				bool X86LowerAMXIntrinsics::lowerTileDPBSSD(Instruction *TileDPBSSD) {
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				Value M, N, K, C, A, B;
				match(TileDPBSSD, m_Intrinsic<Intrinsic::x86_tdpbssd_internal>(
				m_Value(M), m_Value(N), m_Value(K), m_Value(C),
				m_Value(A), m_Value(B)));
				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				Instruction *InsertI = TileDPBSSD;
				IRBuilder<> PreBuilder(TileDPBSSD);
				PreBuilder.SetInsertPoint(TileDPBSSD);
				// We visit the loop with (m, n/4, k/4):
				// %n_dword = lshr i16 %n, 2
				// %k_dword = lshr i16 %k, 2
				Value *NDWord = PreBuilder.CreateLShr(N, PreBuilder.getInt16(2));
				Value *KDWord = PreBuilder.CreateLShr(K, PreBuilder.getInt16(2));
				xiangzhangllvmUnsubmitted Not Done Reply Inline Actions I see you need force match bitcast then replace, add assert for no bitcast case xiangzhangllvm: I see you need force match bitcast then replace, add assert for no bitcast case
				BasicBlock *Start = InsertI->getParent();
				BasicBlock *End =
				SplitBlock(InsertI->getParent(), InsertI, DT, LI, nullptr, "continue");
				IRBuilder<> Builder(TileDPBSSD);
				Value ResVec = createTileDPBSSDLoops(Start, End, Builder, DTU, LI, M,
				NDWord, KDWord, C, A, B);
				// we cannot assume there always be bitcast after tiledpbssd. So we need to
				// insert one bitcast as required
				Builder.SetInsertPoint(End->getFirstNonPHI());
				Value *ResAMX =
				Builder.CreateBitCast(ResVec, Type::getX86_AMXTy(Builder.getContext()));
				// Delete tiledpbssd intrinsic and do some clean-up.
				for (auto UI = TileDPBSSD->use_begin(), UE = TileDPBSSD->use_end();
				UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				LuoYuankeUnsubmitted Done Reply Inline Actions The name seems not good. Is "PreBuilder" better? And why we need two builder in the function? LuoYuanke: The name seems not good. Is "PreBuilder" better? And why we need two builder in the function?
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(ResVec);
				I->eraseFromParent();
				}
				}
				LuoYuankeUnsubmitted Done Reply Inline Actions Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation. LuoYuanke: Maybe use right shift instruction which is more efficient. Don't the following pass can…
				TileDPBSSD->replaceAllUsesWith(ResAMX);
				TileDPBSSD->eraseFromParent();
				return true;
				}

				template <bool IsTileLoad>
				bool X86LowerAMXIntrinsics::lowerTileLoadStore(Instruction *TileLoadStore) {
				Value M, N, Ptr, Stride, *Tile;
				if (IsTileLoad)
				match(TileLoadStore,
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				m_Intrinsic<Intrinsic::x86_tileloadd64_internal>(
				m_Value(M), m_Value(N), m_Value(Ptr), m_Value(Stride)));
				else
				match(TileLoadStore, m_Intrinsic<Intrinsic::x86_tilestored64_internal>(
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				m_Value(M), m_Value(N), m_Value(Ptr),
				m_Value(Stride), m_Value(Tile)));

				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				Instruction *InsertI = TileLoadStore;
				IRBuilder<> PreBuilder(TileLoadStore);
				PreBuilder.SetInsertPoint(TileLoadStore);
				Value *NDWord = PreBuilder.CreateLShr(N, PreBuilder.getInt16(2));
				Value *StrideDWord = PreBuilder.CreateLShr(Stride, PreBuilder.getInt64(2));
				BasicBlock *Start = InsertI->getParent();
				BasicBlock *End =
				SplitBlock(InsertI->getParent(), InsertI, DT, LI, nullptr, "continue");
				IRBuilder<> Builder(TileLoadStore);
				Value *ResVec = createTileLoadStoreLoops<IsTileLoad>(
				Start, End, Builder, DTU, *LI, M, NDWord, Ptr, StrideDWord,
				IsTileLoad ? nullptr : Tile);
				if (IsTileLoad) {
				// we cannot assume there always be bitcast after tileload. So we need to
				// insert one bitcast as required
				Builder.SetInsertPoint(End->getFirstNonPHI());
				LuoYuankeUnsubmitted Done Reply Inline Actions Is "PreBuilder" better? LuoYuanke: Is "PreBuilder" better?
				Value *ResAMX =
				Builder.CreateBitCast(ResVec, Type::getX86_AMXTy(Builder.getContext()));
				// Delete tileloadd6 intrinsic and do some clean-up
				for (auto UI = TileLoadStore->use_begin(), UE = TileLoadStore->use_end();
				LuoYuankeUnsubmitted Done Reply Inline Actions Shift? LuoYuanke: Shift?
				UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(ResVec);
				I->eraseFromParent();
				}
				}
				TileLoadStore->replaceAllUsesWith(ResAMX);
				}
				TileLoadStore->eraseFromParent();
				return true;
				}

				bool X86LowerAMXIntrinsics::lowerTileZero(Instruction *TileZero) {
				IRBuilder<> Builder(TileZero);
				FixedVectorType *V256I32Ty = FixedVectorType::get(Builder.getInt32Ty(), 256);
				Value *VecZero = Constant::getNullValue(V256I32Ty);
				pengfeiUnsubmitted Done Reply Inline Actions `bool C = false` pengfei: `bool C = false`
				for (auto UI = TileZero->use_begin(), UE = TileZero->use_end(); UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(VecZero);
				I->eraseFromParent();
				pengfeiUnsubmitted Done Reply Inline Actions We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast after e.g. x86_tileloadd64_internal. So we need to insert one bitcast as required. pengfei: We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast…
				}
				}
				TileZero->eraseFromParent();
				return true;
				}

				bool X86LowerAMXIntrinsics::visit() {
				bool C = false;
				SmallVector<IntrinsicInst *, 8> WorkList;
				LuoYuankeUnsubmitted Not Done Reply Inline Actions PreBuilder? LuoYuanke: PreBuilder?
				for (BasicBlock *BB : depth_first(&Func)) {
				for (BasicBlock::iterator II = BB->begin(), IE = BB->end(); II != IE;) {
				if (auto Inst = dyn_cast<IntrinsicInst>(&II++)) {
				switch (Inst->getIntrinsicID()) {
				case Intrinsic::x86_tdpbssd_internal:
				case Intrinsic::x86_tileloadd64_internal:
				case Intrinsic::x86_tilestored64_internal:
				case Intrinsic::x86_tilezero_internal:
				WorkList.push_back(Inst);
				break;
				default:
				break;
				}
				}
				}
				}

				for (auto *Inst : WorkList) {
				switch (Inst->getIntrinsicID()) {
				case Intrinsic::x86_tdpbssd_internal:
				C = lowerTileDPBSSD(Inst) \|\| C;
				break;
				pengfeiUnsubmitted Done Reply Inline Actions Remove the `{}` for single line loop. pengfei: Remove the `{}` for single line loop.
				case Intrinsic::x86_tileloadd64_internal:
				xiangzhangllvmUnsubmitted Done Reply Inline Actions '\|' is bits or, use logic \|\| xiangzhangllvm: '\|' is bits or, use logic \|\|
				C = lowerTileLoadStore<true>(Inst) \|\| C;
				break;
				case Intrinsic::x86_tilestored64_internal:
				C = lowerTileLoadStore<false>(Inst) \|\| C;
				break;
				case Intrinsic::x86_tilezero_internal:
				C = lowerTileZero(Inst) \|\| C;
				break;
				default:
				llvm_unreachable("invalid amx intrinsics!");
				}
				}

				return C;
				}
				} // anonymous namespace
				LuoYuankeUnsubmitted Not Done Reply Inline Actions Do we iterate the instructions in topology order or in post order? LuoYuanke: Do we iterate the instructions in topology order or in post order?
				yubingAuthorUnsubmitted Done Reply Inline Actions It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll yubing: It should be pre-order since we need to handle cases without bitcasts, such as, amx-low…

				namespace {

				pengfeiUnsubmitted Done Reply Inline Actions Should be better to use if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst->getIntrinsicID()) { case Intrinsic::x86_tdpbssd_internal: ... pengfei: Should be better to use ``` if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst…
				class X86LowerAMXIntrinsicsLegacyPass : public FunctionPass {
				public:
				static char ID;

				X86LowerAMXIntrinsicsLegacyPass() : FunctionPass(ID) {
				initializeX86LowerAMXIntrinsicsLegacyPassPass(
				*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override {
				TargetMachine *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				if (!F.hasFnAttribute(Attribute::OptimizeNone) &&
				TM->getOptLevel() != CodeGenOpt::None)
				return false;

				auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				pengfeiUnsubmitted Done Reply Inline Actions You can just return it by `return LAT.visit()`. pengfei: You can just return it by `return LAT.visit()`.
				auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();

				X86LowerAMXIntrinsics LAT(F, &DT, &LI);
				return LAT.visit();
				}
				StringRef getPassName() const override { return "Lower AMX intrinsics"; }

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.addPreserved<DominatorTreeWrapperPass>();
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addPreserved<LoopInfoWrapperPass>();
				AU.addRequired<TargetPassConfig>();
				}
				};

				} // anonymous namespace

				static const char PassName[] = "Lower AMX intrinsics";
				char X86LowerAMXIntrinsicsLegacyPass::ID = 0;
				INITIALIZE_PASS_BEGIN(X86LowerAMXIntrinsicsLegacyPass, DEBUG_TYPE, PassName,
				false, false)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
				INITIALIZE_PASS_END(X86LowerAMXIntrinsicsLegacyPass, DEBUG_TYPE, PassName,
				false, false)

				FunctionPass *llvm::createX86LowerAMXIntrinsicsPass() {
				return new X86LowerAMXIntrinsicsLegacyPass();
				}

llvm/lib/Target/X86/X86LowerAMXType.cpp

	Show All 17 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	#include "X86.h"			#include "X86.h"
	#include "llvm/ADT/PostOrderIterator.h"			#include "llvm/ADT/PostOrderIterator.h"
	#include "llvm/ADT/SmallSet.h"			#include "llvm/ADT/SmallSet.h"
	#include "llvm/Analysis/OptimizationRemarkEmitter.h"			#include "llvm/Analysis/OptimizationRemarkEmitter.h"
	#include "llvm/Analysis/TargetTransformInfo.h"			#include "llvm/Analysis/TargetTransformInfo.h"
	#include "llvm/CodeGen/Passes.h"			#include "llvm/CodeGen/Passes.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
	#include "llvm/CodeGen/ValueTypes.h"			#include "llvm/CodeGen/ValueTypes.h"
	#include "llvm/IR/DataLayout.h"			#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/Function.h"			#include "llvm/IR/Function.h"
	#include "llvm/IR/IRBuilder.h"			#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/Instructions.h"			#include "llvm/IR/Instructions.h"
	#include "llvm/IR/IntrinsicInst.h"			#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/IntrinsicsX86.h"			#include "llvm/IR/IntrinsicsX86.h"
	#include "llvm/IR/PatternMatch.h"			#include "llvm/IR/PatternMatch.h"
	#include "llvm/InitializePasses.h"			#include "llvm/InitializePasses.h"
	#include "llvm/Pass.h"			#include "llvm/Pass.h"
				#include "llvm/Target/TargetMachine.h"

	using namespace llvm;			using namespace llvm;
	using namespace PatternMatch;			using namespace PatternMatch;

	#define DEBUG_TYPE "lower-amx-type"			#define DEBUG_TYPE "lower-amx-type"

	static AllocaInst CreateAllocaInst(IRBuilder<> &Builder, BasicBlock BB) {			static AllocaInst CreateAllocaInst(IRBuilder<> &Builder, BasicBlock BB) {
	Function &F = *BB->getParent();			Function &F = *BB->getParent();
	▲ Show 20 Lines • Show All 282 Lines • ▼ Show 20 Lines
	public:			public:
	static char ID;			static char ID;

	X86LowerAMXTypeLegacyPass() : FunctionPass(ID) {			X86LowerAMXTypeLegacyPass() : FunctionPass(ID) {
	initializeX86LowerAMXTypeLegacyPassPass(*PassRegistry::getPassRegistry());			initializeX86LowerAMXTypeLegacyPassPass(*PassRegistry::getPassRegistry());
	}			}

	bool runOnFunction(Function &F) override {			bool runOnFunction(Function &F) override {
				TargetMachine *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
				if (F.hasFnAttribute(Attribute::OptimizeNone) \|\|
				pengfeiUnsubmitted Not Done Reply Inline Actions ditto pengfei: ditto
				TM->getOptLevel() == CodeGenOpt::None)
				return false;
	X86LowerAMXType LAT(F);			X86LowerAMXType LAT(F);
	bool C = LAT.visit();			bool C = LAT.visit();
	return C;			return C;
	}			}

	void getAnalysisUsage(AnalysisUsage &AU) const override {			void getAnalysisUsage(AnalysisUsage &AU) const override {
	AU.setPreservesCFG();			AU.setPreservesCFG();
				AU.addRequired<TargetPassConfig>();
	}			}
	};			};

	} // anonymous namespace			} // anonymous namespace

	static const char PassName[] = "Lower AMX type for load/store";			static const char PassName[] = "Lower AMX type for load/store";
	char X86LowerAMXTypeLegacyPass::ID = 0;			char X86LowerAMXTypeLegacyPass::ID = 0;
	INITIALIZE_PASS_BEGIN(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,			INITIALIZE_PASS_BEGIN(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,
	false)			false)
				INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
	INITIALIZE_PASS_END(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,			INITIALIZE_PASS_END(X86LowerAMXTypeLegacyPass, DEBUG_TYPE, PassName, false,
	false)			false)

	FunctionPass *llvm::createX86LowerAMXTypePass() {			FunctionPass *llvm::createX86LowerAMXTypePass() {
	return new X86LowerAMXTypeLegacyPass();			return new X86LowerAMXTypeLegacyPass();
	}			}

llvm/lib/Target/X86/X86TargetMachine.cpp

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableMachineCombinerPass("x86-machine-combiner",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {
// Register the target.		// Register the target.
RegisterTargetMachine<X86TargetMachine> X(getTheX86_32Target());		RegisterTargetMachine<X86TargetMachine> X(getTheX86_32Target());
RegisterTargetMachine<X86TargetMachine> Y(getTheX86_64Target());		RegisterTargetMachine<X86TargetMachine> Y(getTheX86_64Target());

PassRegistry &PR = *PassRegistry::getPassRegistry();		PassRegistry &PR = *PassRegistry::getPassRegistry();
		initializeX86LowerAMXIntrinsicsLegacyPassPass(PR);
initializeX86LowerAMXTypeLegacyPassPass(PR);		initializeX86LowerAMXTypeLegacyPassPass(PR);
initializeGlobalISel(PR);		initializeGlobalISel(PR);
initializeWinEHStatePassPass(PR);		initializeWinEHStatePassPass(PR);
initializeFixupBWInstPassPass(PR);		initializeFixupBWInstPassPass(PR);
initializeEvexToVexInstPassPass(PR);		initializeEvexToVexInstPassPass(PR);
initializeFixupLEAPassPass(PR);		initializeFixupLEAPassPass(PR);
initializeFPSPass(PR);		initializeFPSPass(PR);
initializeX86FixupSetCCPassPass(PR);		initializeX86FixupSetCCPassPass(PR);
▲ Show 20 Lines • Show All 333 Lines • ▼ Show 20 Lines	INITIALIZE_PASS_END(X86ExecutionDomainFix, "x86-execution-domain-fix",
"X86 Execution Domain Fix", false, false)		"X86 Execution Domain Fix", false, false)

TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {
return new X86PassConfig(*this, PM);		return new X86PassConfig(*this, PM);
}		}

void X86PassConfig::addIRPasses() {		void X86PassConfig::addIRPasses() {
addPass(createAtomicExpandPass());		addPass(createAtomicExpandPass());

		// We add both pass anyway and when these two passes run, we skip the pass
		// based on the option level and option attribute.
		addPass(createX86LowerAMXIntrinsicsPass());
addPass(createX86LowerAMXTypePass());		addPass(createX86LowerAMXTypePass());

		craig.topperUnsubmitted Not Done Reply Inline Actions I don't think you can detect O0 this way. A function can have the optnone attribute in the non-O0 pipeline and won't be optimized by the middle end. This can occur if you mix an 00 translation unit and an O3 translation unit in LTO and use O3 for the LTO pipeline. craig.topper: I don't think you can detect O0 this way. A function can have the optnone attribute in the non…
		LuoYuankeUnsubmitted Done Reply Inline Actions @craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass and determine if the amx intrinsics in the function need to be scalarized? LuoYuanke: @craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass…
TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();
		LuoYuankeUnsubmitted Done Reply Inline Actions We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes. LuoYuanke: We may add both pass anyway and skip the pass based on the option level and option attribute in…

if (TM->getOptLevel() != CodeGenOpt::None) {		if (TM->getOptLevel() != CodeGenOpt::None) {
addPass(createInterleavedAccessPass());		addPass(createInterleavedAccessPass());
addPass(createX86PartialReductionPass());		addPass(createX86PartialReductionPass());
}		}

// Add passes that handle indirect branch removal and insertion of a retpoline		// Add passes that handle indirect branch removal and insertion of a retpoline
// thunk. These will be a no-op unless a function subtarget has the retpoline		// thunk. These will be a no-op unless a function subtarget has the retpoline
▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=x86_64 -lower-amx-intrinsics %s -S \| FileCheck %s
				pengfeiUnsubmitted Not Done Reply Inline Actions I think we should move the files to llvm/test/Transforms/ pengfei: I think we should move the files to llvm/test/Transforms/
				LuoYuankeUnsubmitted Not Done Reply Inline Actions Not sure about it. Our .cpp code is under lib/Target/X86/ folder. LuoYuanke: Not sure about it. Our .cpp code is under lib/Target/X86/ folder.

				define dso_local void @test_no_bitcast(i32* %A_mem, i32* %B_mem, i32* %C_mem) local_unnamed_addr #0 {
				; CHECK-LABEL: @test_no_bitcast(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[C_MEM:%.]] to i8
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_HEADER:%.*]]
				; CHECK: tileload.scalarize.rows.header:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILELOAD_SCALARIZE_ROWS_STEP:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: [[VEC_PHI_ROW:%.]] = phi <256 x i32> [ zeroinitializer, [[ENTRY]] ], [ [[TMP10:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_BODY:%.*]]
				; CHECK: tileload.scalarize.rows.body:
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_HEADER:%.*]]
				; CHECK: tileload.scalarize.cols.header:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_IV:%.]] = phi i16 [ 0, [[TILELOAD_SCALARIZE_ROWS_BODY]] ], [ [[TILELOAD_SCALARIZE_COLS_STEP:%.]], [[TILELOAD_SCALARIZE_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <256 x i32> [ [[VEC_PHI_ROW]], [[TILELOAD_SCALARIZE_ROWS_BODY]] ], [ [[TMP10]], [[TILELOAD_SCALARIZE_COLS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_BODY:%.*]]
				; CHECK: tileload.scalarize.cols.body:
				; CHECK-NEXT: [[TMP1:%.*]] = zext i16 [[TILELOAD_SCALARIZE_ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP2:%.*]] = zext i16 [[TILELOAD_SCALARIZE_COLS_IV]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP1]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[TMP3]], [[TMP2]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast i8 [[TMP0]] to i32*
				; CHECK-NEXT: [[TMP6:%.]] = getelementptr i32, i32 [[TMP5]], i64 [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.*]] = mul i16 [[TILELOAD_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP8:%.*]] = add i16 [[TMP7]], [[TILELOAD_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[TMP6]], align 4
				; CHECK-NEXT: [[TMP10]] = insertelement <256 x i32> [[VEC_PHI]], i32 [[TMP9]], i16 [[TMP8]]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_LATCH]]
				; CHECK: tileload.scalarize.cols.latch:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_STEP]] = add i16 [[TILELOAD_SCALARIZE_COLS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_COND:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_COLS_STEP]], 4
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_COLS_COND]], label [[TILELOAD_SCALARIZE_COLS_HEADER]], label [[TILELOAD_SCALARIZE_ROWS_LATCH]]
				; CHECK: tileload.scalarize.rows.latch:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_STEP]] = add i16 [[TILELOAD_SCALARIZE_ROWS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_COND:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_ROWS_STEP]], 4
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_ROWS_COND]], label [[TILELOAD_SCALARIZE_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: [[TMP11:%.*]] = bitcast <256 x i32> [[TMP10]] to x86_amx
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[A_MEM:%.]] to i8
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_HEADER2:%.*]]
				; CHECK: tileload.scalarize.rows.header2:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_IV5:%.]] = phi i16 [ 0, [[CONTINUE]] ], [ [[TILELOAD_SCALARIZE_ROWS_STEP6:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH4:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI_ROW14:%.]] = phi <256 x i32> [ zeroinitializer, [[CONTINUE]] ], [ [[TMP22:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH4]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_BODY3:%.*]]
				; CHECK: tileload.scalarize.rows.body3:
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_HEADER8:%.*]]
				; CHECK: tileload.scalarize.cols.header8:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_IV11:%.]] = phi i16 [ 0, [[TILELOAD_SCALARIZE_ROWS_BODY3]] ], [ [[TILELOAD_SCALARIZE_COLS_STEP12:%.]], [[TILELOAD_SCALARIZE_COLS_LATCH10:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI15:%.*]] = phi <256 x i32> [ [[VEC_PHI_ROW14]], [[TILELOAD_SCALARIZE_ROWS_BODY3]] ], [ [[TMP22]], [[TILELOAD_SCALARIZE_COLS_LATCH10]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_BODY9:%.*]]
				; CHECK: tileload.scalarize.cols.body9:
				; CHECK-NEXT: [[TMP13:%.*]] = zext i16 [[TILELOAD_SCALARIZE_ROWS_IV5]] to i64
				; CHECK-NEXT: [[TMP14:%.*]] = zext i16 [[TILELOAD_SCALARIZE_COLS_IV11]] to i64
				; CHECK-NEXT: [[TMP15:%.*]] = mul i64 [[TMP13]], 4
				; CHECK-NEXT: [[TMP16:%.*]] = add i64 [[TMP15]], [[TMP14]]
				; CHECK-NEXT: [[TMP17:%.]] = bitcast i8 [[TMP12]] to i32*
				; CHECK-NEXT: [[TMP18:%.]] = getelementptr i32, i32 [[TMP17]], i64 [[TMP16]]
				; CHECK-NEXT: [[TMP19:%.*]] = mul i16 [[TILELOAD_SCALARIZE_ROWS_IV5]], 16
				; CHECK-NEXT: [[TMP20:%.*]] = add i16 [[TMP19]], [[TILELOAD_SCALARIZE_COLS_IV11]]
				; CHECK-NEXT: [[TMP21:%.]] = load i32, i32 [[TMP18]], align 4
				; CHECK-NEXT: [[TMP22]] = insertelement <256 x i32> [[VEC_PHI15]], i32 [[TMP21]], i16 [[TMP20]]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_LATCH10]]
				; CHECK: tileload.scalarize.cols.latch10:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_STEP12]] = add i16 [[TILELOAD_SCALARIZE_COLS_IV11]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_COND13:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_COLS_STEP12]], 4
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_COLS_COND13]], label [[TILELOAD_SCALARIZE_COLS_HEADER8]], label [[TILELOAD_SCALARIZE_ROWS_LATCH4]]
				; CHECK: tileload.scalarize.rows.latch4:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_STEP6]] = add i16 [[TILELOAD_SCALARIZE_ROWS_IV5]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_COND7:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_ROWS_STEP6]], 4
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_ROWS_COND7]], label [[TILELOAD_SCALARIZE_ROWS_HEADER2]], label [[CONTINUE1:%.*]]
				; CHECK: continue1:
				; CHECK-NEXT: [[TMP23:%.*]] = bitcast <256 x i32> [[TMP22]] to x86_amx
				; CHECK-NEXT: [[TMP24:%.]] = bitcast i32 [[B_MEM:%.]] to i8
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_HEADER17:%.*]]
				; CHECK: tileload.scalarize.rows.header17:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_IV20:%.]] = phi i16 [ 0, [[CONTINUE1]] ], [ [[TILELOAD_SCALARIZE_ROWS_STEP21:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH19:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI_ROW29:%.]] = phi <256 x i32> [ zeroinitializer, [[CONTINUE1]] ], [ [[TMP34:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH19]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_BODY18:%.*]]
				; CHECK: tileload.scalarize.rows.body18:
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_HEADER23:%.*]]
				; CHECK: tileload.scalarize.cols.header23:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_IV26:%.]] = phi i16 [ 0, [[TILELOAD_SCALARIZE_ROWS_BODY18]] ], [ [[TILELOAD_SCALARIZE_COLS_STEP27:%.]], [[TILELOAD_SCALARIZE_COLS_LATCH25:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI30:%.*]] = phi <256 x i32> [ [[VEC_PHI_ROW29]], [[TILELOAD_SCALARIZE_ROWS_BODY18]] ], [ [[TMP34]], [[TILELOAD_SCALARIZE_COLS_LATCH25]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_BODY24:%.*]]
				; CHECK: tileload.scalarize.cols.body24:
				; CHECK-NEXT: [[TMP25:%.*]] = zext i16 [[TILELOAD_SCALARIZE_ROWS_IV20]] to i64
				; CHECK-NEXT: [[TMP26:%.*]] = zext i16 [[TILELOAD_SCALARIZE_COLS_IV26]] to i64
				; CHECK-NEXT: [[TMP27:%.*]] = mul i64 [[TMP25]], 4
				; CHECK-NEXT: [[TMP28:%.*]] = add i64 [[TMP27]], [[TMP26]]
				; CHECK-NEXT: [[TMP29:%.]] = bitcast i8 [[TMP24]] to i32*
				; CHECK-NEXT: [[TMP30:%.]] = getelementptr i32, i32 [[TMP29]], i64 [[TMP28]]
				; CHECK-NEXT: [[TMP31:%.*]] = mul i16 [[TILELOAD_SCALARIZE_ROWS_IV20]], 16
				; CHECK-NEXT: [[TMP32:%.*]] = add i16 [[TMP31]], [[TILELOAD_SCALARIZE_COLS_IV26]]
				; CHECK-NEXT: [[TMP33:%.]] = load i32, i32 [[TMP30]], align 4
				; CHECK-NEXT: [[TMP34]] = insertelement <256 x i32> [[VEC_PHI30]], i32 [[TMP33]], i16 [[TMP32]]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_LATCH25]]
				; CHECK: tileload.scalarize.cols.latch25:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_STEP27]] = add i16 [[TILELOAD_SCALARIZE_COLS_IV26]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_COND28:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_COLS_STEP27]], 4
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_COLS_COND28]], label [[TILELOAD_SCALARIZE_COLS_HEADER23]], label [[TILELOAD_SCALARIZE_ROWS_LATCH19]]
				; CHECK: tileload.scalarize.rows.latch19:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_STEP21]] = add i16 [[TILELOAD_SCALARIZE_ROWS_IV20]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_COND22:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_ROWS_STEP21]], 4
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_ROWS_COND22]], label [[TILELOAD_SCALARIZE_ROWS_HEADER17]], label [[CONTINUE16:%.*]]
				; CHECK: continue16:
				; CHECK-NEXT: [[TMP35:%.*]] = bitcast <256 x i32> [[TMP34]] to x86_amx
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_ROWS_HEADER:%.*]]
				; CHECK: tiledpbssd.scalarize.rows.header:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_ROWS_IV:%.]] = phi i16 [ 0, [[CONTINUE16]] ], [ [[TILEDPBSSD_SCALARIZE_ROWS_STEP:%.]], [[TILEDPBSSD_SCALARIZE_ROWS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_C_PHI_ROW:%.]] = phi <256 x i32> [ [[TMP10]], [[CONTINUE16]] ], [ [[TMP52:%.]], [[TILEDPBSSD_SCALARIZE_ROWS_LATCH]] ]
				; CHECK-NEXT: [[VEC_D_PHI_ROW:%.]] = phi <256 x i32> [ zeroinitializer, [[CONTINUE16]] ], [ [[TMP54:%.]], [[TILEDPBSSD_SCALARIZE_ROWS_LATCH]] ]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_ROWS_BODY:%.*]]
				; CHECK: tiledpbssd.scalarize.rows.body:
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_COLS_HEADER:%.*]]
				; CHECK: tiledpbssd.scalarize.cols.header:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_COLS_IV:%.]] = phi i16 [ 0, [[TILEDPBSSD_SCALARIZE_ROWS_BODY]] ], [ [[TILEDPBSSD_SCALARIZE_COLS_STEP:%.]], [[TILEDPBSSD_SCALARIZE_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_C_PHI_COL:%.*]] = phi <256 x i32> [ [[VEC_C_PHI_ROW]], [[TILEDPBSSD_SCALARIZE_ROWS_BODY]] ], [ [[TMP52]], [[TILEDPBSSD_SCALARIZE_COLS_LATCH]] ]
				; CHECK-NEXT: [[VEC_D_PHI_COL:%.*]] = phi <256 x i32> [ [[VEC_D_PHI_ROW]], [[TILEDPBSSD_SCALARIZE_ROWS_BODY]] ], [ [[TMP54]], [[TILEDPBSSD_SCALARIZE_COLS_LATCH]] ]
				; CHECK-NEXT: [[TMP36:%.*]] = mul i16 [[TILEDPBSSD_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP37:%.*]] = add i16 [[TMP36]], [[TILEDPBSSD_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_COLS_BODY:%.*]]
				; CHECK: tiledpbssd.scalarize.cols.body:
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_INNER_HEADER:%.*]]
				; CHECK: tiledpbssd.scalarize.inner.header:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_INNER_IV:%.]] = phi i16 [ 0, [[TILEDPBSSD_SCALARIZE_COLS_BODY]] ], [ [[TILEDPBSSD_SCALARIZE_INNER_STEP:%.]], [[TILEDPBSSD_SCALARIZE_INNER_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_C_INNER_PHI:%.*]] = phi <256 x i32> [ [[VEC_C_PHI_COL]], [[TILEDPBSSD_SCALARIZE_COLS_BODY]] ], [ [[TMP52]], [[TILEDPBSSD_SCALARIZE_INNER_LATCH]] ]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_INNER_BODY:%.*]]
				; CHECK: tiledpbssd.scalarize.inner.body:
				; CHECK-NEXT: [[TMP38:%.*]] = mul i16 [[TILEDPBSSD_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP39:%.*]] = add i16 [[TMP38]], [[TILEDPBSSD_SCALARIZE_INNER_IV]]
				; CHECK-NEXT: [[TMP40:%.*]] = mul i16 [[TILEDPBSSD_SCALARIZE_INNER_IV]], 16
				; CHECK-NEXT: [[TMP41:%.*]] = add i16 [[TMP40]], [[TILEDPBSSD_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: [[TMP42:%.*]] = extractelement <256 x i32> [[VEC_C_INNER_PHI]], i16 [[TMP37]]
				; CHECK-NEXT: [[TMP43:%.*]] = extractelement <256 x i32> [[TMP22]], i16 [[TMP39]]
				; CHECK-NEXT: [[TMP44:%.*]] = bitcast i32 [[TMP43]] to <4 x i8>
				; CHECK-NEXT: [[TMP45:%.*]] = extractelement <256 x i32> [[TMP34]], i16 [[TMP41]]
				; CHECK-NEXT: [[TMP46:%.*]] = bitcast i32 [[TMP45]] to <4 x i8>
				; CHECK-NEXT: [[TMP47:%.*]] = sext <4 x i8> [[TMP46]] to <4 x i32>
				; CHECK-NEXT: [[TMP48:%.*]] = sext <4 x i8> [[TMP44]] to <4 x i32>
				; CHECK-NEXT: [[TMP49:%.*]] = mul <4 x i32> [[TMP48]], [[TMP47]]
				; CHECK-NEXT: [[TMP50:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP49]])
				; CHECK-NEXT: [[TMP51:%.*]] = add i32 [[TMP42]], [[TMP50]]
				; CHECK-NEXT: [[TMP52]] = insertelement <256 x i32> [[VEC_C_INNER_PHI]], i32 [[TMP51]], i16 [[TMP37]]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_INNER_LATCH]]
				; CHECK: tiledpbssd.scalarize.inner.latch:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_INNER_STEP]] = add i16 [[TILEDPBSSD_SCALARIZE_INNER_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_INNER_COND:%.*]] = icmp ne i16 [[TILEDPBSSD_SCALARIZE_INNER_STEP]], 4
				; CHECK-NEXT: br i1 [[TILEDPBSSD_SCALARIZE_INNER_COND]], label [[TILEDPBSSD_SCALARIZE_INNER_HEADER]], label [[TILEDPBSSD_SCALARIZE_COLS_LATCH]]
				; CHECK: tiledpbssd.scalarize.cols.latch:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_COLS_STEP]] = add i16 [[TILEDPBSSD_SCALARIZE_COLS_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_COLS_COND:%.*]] = icmp ne i16 [[TILEDPBSSD_SCALARIZE_COLS_STEP]], 4
				; CHECK-NEXT: [[TMP53:%.*]] = extractelement <256 x i32> [[TMP52]], i16 [[TMP37]]
				; CHECK-NEXT: [[TMP54]] = insertelement <256 x i32> [[VEC_D_PHI_COL]], i32 [[TMP53]], i16 [[TMP37]]
				; CHECK-NEXT: br i1 [[TILEDPBSSD_SCALARIZE_COLS_COND]], label [[TILEDPBSSD_SCALARIZE_COLS_HEADER]], label [[TILEDPBSSD_SCALARIZE_ROWS_LATCH]]
				; CHECK: tiledpbssd.scalarize.rows.latch:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_ROWS_STEP]] = add i16 [[TILEDPBSSD_SCALARIZE_ROWS_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_ROWS_COND:%.*]] = icmp ne i16 [[TILEDPBSSD_SCALARIZE_ROWS_STEP]], 4
				; CHECK-NEXT: br i1 [[TILEDPBSSD_SCALARIZE_ROWS_COND]], label [[TILEDPBSSD_SCALARIZE_ROWS_HEADER]], label [[CONTINUE31:%.*]]
				; CHECK: continue31:
				; CHECK-NEXT: [[TMP55:%.*]] = bitcast <256 x i32> [[TMP54]] to x86_amx
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_ROWS_HEADER:%.*]]
				; CHECK: tilestore.scalarize.rows.header:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_ROWS_IV:%.]] = phi i16 [ 0, [[CONTINUE31]] ], [ [[TILESTORE_SCALARIZE_ROWS_STEP:%.]], [[TILESTORE_SCALARIZE_ROWS_LATCH:%.*]] ]
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_ROWS_BODY:%.*]]
				; CHECK: tilestore.scalarize.rows.body:
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_COLS_HEADER:%.*]]
				; CHECK: tilestore.scalarize.cols.header:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_COLS_IV:%.]] = phi i16 [ 0, [[TILESTORE_SCALARIZE_ROWS_BODY]] ], [ [[TILESTORE_SCALARIZE_COLS_STEP:%.]], [[TILESTORE_SCALARIZE_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_COLS_BODY:%.*]]
				; CHECK: tilestore.scalarize.cols.body:
				; CHECK-NEXT: [[TMP56:%.*]] = zext i16 [[TILESTORE_SCALARIZE_ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP57:%.*]] = zext i16 [[TILESTORE_SCALARIZE_COLS_IV]] to i64
				; CHECK-NEXT: [[TMP58:%.*]] = mul i64 [[TMP56]], 4
				; CHECK-NEXT: [[TMP59:%.*]] = add i64 [[TMP58]], [[TMP57]]
				; CHECK-NEXT: [[TMP60:%.]] = bitcast i8 [[TMP0]] to i32*
				; CHECK-NEXT: [[TMP61:%.]] = getelementptr i32, i32 [[TMP60]], i64 [[TMP59]]
				; CHECK-NEXT: [[TMP62:%.*]] = mul i16 [[TILESTORE_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP63:%.*]] = add i16 [[TMP62]], [[TILESTORE_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: [[TMP64:%.*]] = extractelement <256 x i32> [[TMP54]], i16 [[TMP63]]
				; CHECK-NEXT: store i32 [[TMP64]], i32* [[TMP61]], align 4
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_COLS_LATCH]]
				; CHECK: tilestore.scalarize.cols.latch:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_COLS_STEP]] = add i16 [[TILESTORE_SCALARIZE_COLS_IV]], 1
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_COLS_COND:%.*]] = icmp ne i16 [[TILESTORE_SCALARIZE_COLS_STEP]], 4
				; CHECK-NEXT: br i1 [[TILESTORE_SCALARIZE_COLS_COND]], label [[TILESTORE_SCALARIZE_COLS_HEADER]], label [[TILESTORE_SCALARIZE_ROWS_LATCH]]
				; CHECK: tilestore.scalarize.rows.latch:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_ROWS_STEP]] = add i16 [[TILESTORE_SCALARIZE_ROWS_IV]], 1
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_ROWS_COND:%.*]] = icmp ne i16 [[TILESTORE_SCALARIZE_ROWS_STEP]], 4
				; CHECK-NEXT: br i1 [[TILESTORE_SCALARIZE_ROWS_COND]], label [[TILESTORE_SCALARIZE_ROWS_HEADER]], label [[CONTINUE32:%.*]]
				; CHECK: continue32:
				; CHECK-NEXT: ret void
				;
				entry:
				%0 = bitcast i32* %C_mem to i8*
				%1 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 4, i16 16, i8* %0, i64 16)
				%2 = bitcast i32* %A_mem to i8*
				%3 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 4, i16 16, i8* %2, i64 16)
				%4 = bitcast i32* %B_mem to i8*
				%5 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 4, i16 16, i8* %4, i64 16)
				%6 = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 4, i16 16, i16 16, x86_amx %1, x86_amx %3, x86_amx %5)
				tail call void @llvm.x86.tilestored64.internal(i16 4, i16 16, i8* %0, i64 16, x86_amx %6)
				ret void
				}

				declare x86_amx @llvm.x86.tileloadd64.internal(i16, i16, i8*, i64)
				declare x86_amx @llvm.x86.tdpbssd.internal(i16, i16, i16, x86_amx, x86_amx, x86_amx)
				declare void @llvm.x86.tilestored64.internal(i16, i16, i8*, i64, x86_amx)

				attributes #0 = { noinline nounwind optnone }

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=x86_64 -lower-amx-intrinsics %s -S \| FileCheck %s

				define dso_local void @test_amx_load_non_O0(i16 signext %row, i16 signext %col, i8 %ptr, i64 %stride, <256 x i32> %vptr) {
				; CHECK-LABEL: @test_amx_load_non_O0(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = lshr i16 [[COL:%.]], 2
				; CHECK-NEXT: [[TMP1:%.]] = lshr i64 [[STRIDE:%.]], 2
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_HEADER:%.*]]
				; CHECK: tileload.scalarize.rows.header:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILELOAD_SCALARIZE_ROWS_STEP:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: [[VEC_PHI_ROW:%.]] = phi <256 x i32> [ zeroinitializer, [[ENTRY]] ], [ [[TMP11:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_BODY:%.*]]
				; CHECK: tileload.scalarize.rows.body:
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_HEADER:%.*]]
				; CHECK: tileload.scalarize.cols.header:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_IV:%.]] = phi i16 [ 0, [[TILELOAD_SCALARIZE_ROWS_BODY]] ], [ [[TILELOAD_SCALARIZE_COLS_STEP:%.]], [[TILELOAD_SCALARIZE_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <256 x i32> [ [[VEC_PHI_ROW]], [[TILELOAD_SCALARIZE_ROWS_BODY]] ], [ [[TMP11]], [[TILELOAD_SCALARIZE_COLS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_BODY:%.*]]
				; CHECK: tileload.scalarize.cols.body:
				; CHECK-NEXT: [[TMP2:%.*]] = zext i16 [[TILELOAD_SCALARIZE_ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = zext i16 [[TILELOAD_SCALARIZE_COLS_IV]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP2]], [[TMP1]]
				; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[TMP4]], [[TMP3]]
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[PTR:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP6]], i64 [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.*]] = mul i16 [[TILELOAD_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP9:%.*]] = add i16 [[TMP8]], [[TILELOAD_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP7]], align 4
				; CHECK-NEXT: [[TMP11]] = insertelement <256 x i32> [[VEC_PHI]], i32 [[TMP10]], i16 [[TMP9]]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_LATCH]]
				; CHECK: tileload.scalarize.cols.latch:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_STEP]] = add i16 [[TILELOAD_SCALARIZE_COLS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_COND:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_COLS_STEP]], [[TMP0]]
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_COLS_COND]], label [[TILELOAD_SCALARIZE_COLS_HEADER]], label [[TILELOAD_SCALARIZE_ROWS_LATCH]]
				; CHECK: tileload.scalarize.rows.latch:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_STEP]] = add i16 [[TILELOAD_SCALARIZE_ROWS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_COND:%.]] = icmp ne i16 [[TILELOAD_SCALARIZE_ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_ROWS_COND]], label [[TILELOAD_SCALARIZE_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: [[TMP12:%.*]] = bitcast <256 x i32> [[TMP11]] to x86_amx
				; CHECK-NEXT: store <256 x i32> [[TMP11]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %ptr, i64 %stride)
				%vec = bitcast x86_amx %amx to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				define dso_local void @test_amx_load(i16 signext %row, i16 signext %col, i8 %ptr, i64 %stride, <256 x i32> %vptr) #0 {
				; CHECK-LABEL: @test_amx_load(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = lshr i16 [[COL:%.]], 2
				; CHECK-NEXT: [[TMP1:%.]] = lshr i64 [[STRIDE:%.]], 2
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_HEADER:%.*]]
				; CHECK: tileload.scalarize.rows.header:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILELOAD_SCALARIZE_ROWS_STEP:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: [[VEC_PHI_ROW:%.]] = phi <256 x i32> [ zeroinitializer, [[ENTRY]] ], [ [[TMP11:%.]], [[TILELOAD_SCALARIZE_ROWS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_BODY:%.*]]
				pengfeiUnsubmitted Not Done Reply Inline Actions Maybe we can use zero mask load in future optimization. pengfei: Maybe we can use zero mask load in future optimization.
				; CHECK: tileload.scalarize.rows.body:
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_HEADER:%.*]]
				; CHECK: tileload.scalarize.cols.header:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_IV:%.]] = phi i16 [ 0, [[TILELOAD_SCALARIZE_ROWS_BODY]] ], [ [[TILELOAD_SCALARIZE_COLS_STEP:%.]], [[TILELOAD_SCALARIZE_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <256 x i32> [ [[VEC_PHI_ROW]], [[TILELOAD_SCALARIZE_ROWS_BODY]] ], [ [[TMP11]], [[TILELOAD_SCALARIZE_COLS_LATCH]] ]
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_BODY:%.*]]
				; CHECK: tileload.scalarize.cols.body:
				; CHECK-NEXT: [[TMP2:%.*]] = zext i16 [[TILELOAD_SCALARIZE_ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = zext i16 [[TILELOAD_SCALARIZE_COLS_IV]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP2]], [[TMP1]]
				; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[TMP4]], [[TMP3]]
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[PTR:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP6]], i64 [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.*]] = mul i16 [[TILELOAD_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP9:%.*]] = add i16 [[TMP8]], [[TILELOAD_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[TMP7]], align 4
				; CHECK-NEXT: [[TMP11]] = insertelement <256 x i32> [[VEC_PHI]], i32 [[TMP10]], i16 [[TMP9]]
				yubingAuthorUnsubmitted Done Reply Inline Actions Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero. yubing: Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero.
				; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_LATCH]]
				; CHECK: tileload.scalarize.cols.latch:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_STEP]] = add i16 [[TILELOAD_SCALARIZE_COLS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_COND:%.*]] = icmp ne i16 [[TILELOAD_SCALARIZE_COLS_STEP]], [[TMP0]]
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_COLS_COND]], label [[TILELOAD_SCALARIZE_COLS_HEADER]], label [[TILELOAD_SCALARIZE_ROWS_LATCH]]
				; CHECK: tileload.scalarize.rows.latch:
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_STEP]] = add i16 [[TILELOAD_SCALARIZE_ROWS_IV]], 1
				; CHECK-NEXT: [[TILELOAD_SCALARIZE_ROWS_COND:%.]] = icmp ne i16 [[TILELOAD_SCALARIZE_ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[TILELOAD_SCALARIZE_ROWS_COND]], label [[TILELOAD_SCALARIZE_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: [[TMP12:%.*]] = bitcast <256 x i32> [[TMP11]] to x86_amx
				; CHECK-NEXT: store <256 x i32> [[TMP11]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %ptr, i64 %stride)
				%vec = bitcast x86_amx %amx to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				define dso_local void @test_amx_dp(i16 signext %row, i16 signext %col, i16 signext %k, <256 x i32> %c, <256 x i32> %a, <256 x i32> %b, <256 x i32>* %vptr) #0 {
				; CHECK-LABEL: @test_amx_dp(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_AMX:%.]] = bitcast <256 x i32> [[A:%.]] to x86_amx
				; CHECK-NEXT: [[B_AMX:%.]] = bitcast <256 x i32> [[B:%.]] to x86_amx
				; CHECK-NEXT: [[C_AMX:%.]] = bitcast <256 x i32> [[C:%.]] to x86_amx
				; CHECK-NEXT: [[TMP0:%.]] = lshr i16 [[COL:%.]], 2
				; CHECK-NEXT: [[TMP1:%.]] = lshr i16 [[K:%.]], 2
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_ROWS_HEADER:%.*]]
				; CHECK: tiledpbssd.scalarize.rows.header:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILEDPBSSD_SCALARIZE_ROWS_STEP:%.]], [[TILEDPBSSD_SCALARIZE_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: [[VEC_C_PHI_ROW:%.]] = phi <256 x i32> [ [[C]], [[ENTRY]] ], [ [[TMP18:%.]], [[TILEDPBSSD_SCALARIZE_ROWS_LATCH]] ]
				; CHECK-NEXT: [[VEC_D_PHI_ROW:%.]] = phi <256 x i32> [ zeroinitializer, [[ENTRY]] ], [ [[TMP20:%.]], [[TILEDPBSSD_SCALARIZE_ROWS_LATCH]] ]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_ROWS_BODY:%.*]]
				; CHECK: tiledpbssd.scalarize.rows.body:
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_COLS_HEADER:%.*]]
				; CHECK: tiledpbssd.scalarize.cols.header:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_COLS_IV:%.]] = phi i16 [ 0, [[TILEDPBSSD_SCALARIZE_ROWS_BODY]] ], [ [[TILEDPBSSD_SCALARIZE_COLS_STEP:%.]], [[TILEDPBSSD_SCALARIZE_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_C_PHI_COL:%.*]] = phi <256 x i32> [ [[VEC_C_PHI_ROW]], [[TILEDPBSSD_SCALARIZE_ROWS_BODY]] ], [ [[TMP18]], [[TILEDPBSSD_SCALARIZE_COLS_LATCH]] ]
				; CHECK-NEXT: [[VEC_D_PHI_COL:%.*]] = phi <256 x i32> [ [[VEC_D_PHI_ROW]], [[TILEDPBSSD_SCALARIZE_ROWS_BODY]] ], [ [[TMP20]], [[TILEDPBSSD_SCALARIZE_COLS_LATCH]] ]
				; CHECK-NEXT: [[TMP2:%.*]] = mul i16 [[TILEDPBSSD_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP3:%.*]] = add i16 [[TMP2]], [[TILEDPBSSD_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_COLS_BODY:%.*]]
				; CHECK: tiledpbssd.scalarize.cols.body:
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_INNER_HEADER:%.*]]
				; CHECK: tiledpbssd.scalarize.inner.header:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_INNER_IV:%.]] = phi i16 [ 0, [[TILEDPBSSD_SCALARIZE_COLS_BODY]] ], [ [[TILEDPBSSD_SCALARIZE_INNER_STEP:%.]], [[TILEDPBSSD_SCALARIZE_INNER_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_C_INNER_PHI:%.*]] = phi <256 x i32> [ [[VEC_C_PHI_COL]], [[TILEDPBSSD_SCALARIZE_COLS_BODY]] ], [ [[TMP18]], [[TILEDPBSSD_SCALARIZE_INNER_LATCH]] ]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_INNER_BODY:%.*]]
				; CHECK: tiledpbssd.scalarize.inner.body:
				; CHECK-NEXT: [[TMP4:%.*]] = mul i16 [[TILEDPBSSD_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP5:%.*]] = add i16 [[TMP4]], [[TILEDPBSSD_SCALARIZE_INNER_IV]]
				; CHECK-NEXT: [[TMP6:%.*]] = mul i16 [[TILEDPBSSD_SCALARIZE_INNER_IV]], 16
				; CHECK-NEXT: [[TMP7:%.*]] = add i16 [[TMP6]], [[TILEDPBSSD_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: [[TMP8:%.*]] = extractelement <256 x i32> [[VEC_C_INNER_PHI]], i16 [[TMP3]]
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <256 x i32> [[A]], i16 [[TMP5]]
				; CHECK-NEXT: [[TMP10:%.*]] = bitcast i32 [[TMP9]] to <4 x i8>
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <256 x i32> [[B]], i16 [[TMP7]]
				; CHECK-NEXT: [[TMP12:%.*]] = bitcast i32 [[TMP11]] to <4 x i8>
				; CHECK-NEXT: [[TMP13:%.*]] = sext <4 x i8> [[TMP12]] to <4 x i32>
				; CHECK-NEXT: [[TMP14:%.*]] = sext <4 x i8> [[TMP10]] to <4 x i32>
				; CHECK-NEXT: [[TMP15:%.*]] = mul <4 x i32> [[TMP14]], [[TMP13]]
				; CHECK-NEXT: [[TMP16:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP15]])
				; CHECK-NEXT: [[TMP17:%.*]] = add i32 [[TMP8]], [[TMP16]]
				; CHECK-NEXT: [[TMP18]] = insertelement <256 x i32> [[VEC_C_INNER_PHI]], i32 [[TMP17]], i16 [[TMP3]]
				; CHECK-NEXT: br label [[TILEDPBSSD_SCALARIZE_INNER_LATCH]]
				; CHECK: tiledpbssd.scalarize.inner.latch:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_INNER_STEP]] = add i16 [[TILEDPBSSD_SCALARIZE_INNER_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_INNER_COND:%.*]] = icmp ne i16 [[TILEDPBSSD_SCALARIZE_INNER_STEP]], [[TMP1]]
				; CHECK-NEXT: br i1 [[TILEDPBSSD_SCALARIZE_INNER_COND]], label [[TILEDPBSSD_SCALARIZE_INNER_HEADER]], label [[TILEDPBSSD_SCALARIZE_COLS_LATCH]]
				; CHECK: tiledpbssd.scalarize.cols.latch:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_COLS_STEP]] = add i16 [[TILEDPBSSD_SCALARIZE_COLS_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_COLS_COND:%.*]] = icmp ne i16 [[TILEDPBSSD_SCALARIZE_COLS_STEP]], [[TMP0]]
				; CHECK-NEXT: [[TMP19:%.*]] = extractelement <256 x i32> [[TMP18]], i16 [[TMP3]]
				; CHECK-NEXT: [[TMP20]] = insertelement <256 x i32> [[VEC_D_PHI_COL]], i32 [[TMP19]], i16 [[TMP3]]
				; CHECK-NEXT: br i1 [[TILEDPBSSD_SCALARIZE_COLS_COND]], label [[TILEDPBSSD_SCALARIZE_COLS_HEADER]], label [[TILEDPBSSD_SCALARIZE_ROWS_LATCH]]
				; CHECK: tiledpbssd.scalarize.rows.latch:
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_ROWS_STEP]] = add i16 [[TILEDPBSSD_SCALARIZE_ROWS_IV]], 1
				; CHECK-NEXT: [[TILEDPBSSD_SCALARIZE_ROWS_COND:%.]] = icmp ne i16 [[TILEDPBSSD_SCALARIZE_ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[TILEDPBSSD_SCALARIZE_ROWS_COND]], label [[TILEDPBSSD_SCALARIZE_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: [[TMP21:%.*]] = bitcast <256 x i32> [[TMP20]] to x86_amx
				; CHECK-NEXT: store <256 x i32> [[TMP20]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%a.amx = bitcast <256 x i32> %a to x86_amx
				%b.amx = bitcast <256 x i32> %b to x86_amx
				%c.amx = bitcast <256 x i32> %c to x86_amx
				%acc = call x86_amx @llvm.x86.tdpbssd.internal(i16 %row, i16 %col, i16 %k, x86_amx %c.amx, x86_amx %a.amx, x86_amx %b.amx)
				%vec = bitcast x86_amx %acc to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				define dso_local void @test_amx_store(i16 signext %row, i16 signext %col, i8 %ptr, i64 %stride, <256 x i32> %vptr, <256 x i32> %vec) #0 {
				; CHECK-LABEL: @test_amx_store(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[AMX:%.]] = bitcast <256 x i32> [[VEC:%.]] to x86_amx
				; CHECK-NEXT: [[TMP0:%.]] = lshr i16 [[COL:%.]], 2
				; CHECK-NEXT: [[TMP1:%.]] = lshr i64 [[STRIDE:%.]], 2
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_ROWS_HEADER:%.*]]
				; CHECK: tilestore.scalarize.rows.header:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TILESTORE_SCALARIZE_ROWS_STEP:%.]], [[TILESTORE_SCALARIZE_ROWS_LATCH:%.]] ]
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_ROWS_BODY:%.*]]
				; CHECK: tilestore.scalarize.rows.body:
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_COLS_HEADER:%.*]]
				; CHECK: tilestore.scalarize.cols.header:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_COLS_IV:%.]] = phi i16 [ 0, [[TILESTORE_SCALARIZE_ROWS_BODY]] ], [ [[TILESTORE_SCALARIZE_COLS_STEP:%.]], [[TILESTORE_SCALARIZE_COLS_LATCH:%.*]] ]
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_COLS_BODY:%.*]]
				; CHECK: tilestore.scalarize.cols.body:
				; CHECK-NEXT: [[TMP2:%.*]] = zext i16 [[TILESTORE_SCALARIZE_ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP3:%.*]] = zext i16 [[TILESTORE_SCALARIZE_COLS_IV]] to i64
				; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP2]], [[TMP1]]
				; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[TMP4]], [[TMP3]]
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[PTR:%.]] to i32
				; CHECK-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP6]], i64 [[TMP5]]
				; CHECK-NEXT: [[TMP8:%.*]] = mul i16 [[TILESTORE_SCALARIZE_ROWS_IV]], 16
				; CHECK-NEXT: [[TMP9:%.*]] = add i16 [[TMP8]], [[TILESTORE_SCALARIZE_COLS_IV]]
				; CHECK-NEXT: [[TMP10:%.*]] = extractelement <256 x i32> [[VEC]], i16 [[TMP9]]
				; CHECK-NEXT: store i32 [[TMP10]], i32* [[TMP7]], align 4
				; CHECK-NEXT: br label [[TILESTORE_SCALARIZE_COLS_LATCH]]
				; CHECK: tilestore.scalarize.cols.latch:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_COLS_STEP]] = add i16 [[TILESTORE_SCALARIZE_COLS_IV]], 1
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_COLS_COND:%.*]] = icmp ne i16 [[TILESTORE_SCALARIZE_COLS_STEP]], [[TMP0]]
				; CHECK-NEXT: br i1 [[TILESTORE_SCALARIZE_COLS_COND]], label [[TILESTORE_SCALARIZE_COLS_HEADER]], label [[TILESTORE_SCALARIZE_ROWS_LATCH]]
				; CHECK: tilestore.scalarize.rows.latch:
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_ROWS_STEP]] = add i16 [[TILESTORE_SCALARIZE_ROWS_IV]], 1
				; CHECK-NEXT: [[TILESTORE_SCALARIZE_ROWS_COND:%.]] = icmp ne i16 [[TILESTORE_SCALARIZE_ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[TILESTORE_SCALARIZE_ROWS_COND]], label [[TILESTORE_SCALARIZE_ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = bitcast <256 x i32> %vec to x86_amx
				call void @llvm.x86.tilestored64.internal(i16 %row, i16 %col, i8* %ptr, i64 %stride, x86_amx %amx)
				ret void
				}

				define dso_local void @test_amx_zero(i16 signext %row, i16 signext %col, <256 x i32>* %vptr) #0 {
				; CHECK-LABEL: @test_amx_zero(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: store <256 x i32> zeroinitializer, <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = call x86_amx @llvm.x86.tilezero.internal(i16 %row, i16 %col)
				%vec = bitcast x86_amx %amx to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				declare x86_amx @llvm.x86.tilezero.internal(i16, i16)
				declare x86_amx @llvm.x86.tileloadd64.internal(i16, i16, i8*, i64)
				declare x86_amx @llvm.x86.tdpbssd.internal(i16, i16, i16, x86_amx, x86_amx, x86_amx)
				declare void @llvm.x86.tilestored64.internal(i16, i16, i8*, i64, x86_amx)

				attributes #0 = { noinline nounwind optnone }

llvm/test/CodeGen/X86/AMX/amx-type.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -lower-amx-type %s -S \| FileCheck %s			; RUN: opt --codegen-opt-level=2 -mtriple=x86_64 -lower-amx-type %s -S \| FileCheck %s
				pengfeiUnsubmitted Not Done Reply Inline Actions Why adding this? Is it O2 by default? pengfei: Why adding this? Is it O2 by default?
				LuoYuankeUnsubmitted Not Done Reply Inline Actions I think this is to test with opt level 2 this pass do nothing. LuoYuanke: I think this is to test with opt level 2 this pass do nothing.

	%struct.__tile_str = type { i16, i16, <256 x i32> }			%struct.__tile_str = type { i16, i16, <256 x i32> }

	@buf = dso_local global [1024 x i8] zeroinitializer, align 64			@buf = dso_local global [1024 x i8] zeroinitializer, align 64
	@buf2 = dso_local global [1024 x i8] zeroinitializer, align 64			@buf2 = dso_local global [1024 x i8] zeroinitializer, align 64

	; test bitcast x86_amx to <256 x i32>			; test bitcast x86_amx to <256 x i32>
	define dso_local void @test_user_empty(i16 %m, i16 %n, i8 *%buf, i64 %s) {			define dso_local void @test_user_empty(i16 %m, i16 %n, i8 *%buf, i64 %s) {
	▲ Show 20 Lines • Show All 220 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/O0-pipeline.ll

	Show All 12 Lines
	; CHECK-NEXT: Create Garbage Collector Module Metadata			; CHECK-NEXT: Create Garbage Collector Module Metadata
	; CHECK-NEXT: Assumption Cache Tracker			; CHECK-NEXT: Assumption Cache Tracker
	; CHECK-NEXT: Profile summary info			; CHECK-NEXT: Profile summary info
	; CHECK-NEXT: Machine Branch Probability Analysis			; CHECK-NEXT: Machine Branch Probability Analysis
	; CHECK-NEXT: ModulePass Manager			; CHECK-NEXT: ModulePass Manager
	; CHECK-NEXT: Pre-ISel Intrinsic Lowering			; CHECK-NEXT: Pre-ISel Intrinsic Lowering
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	; CHECK-NEXT: Expand Atomic instructions			; CHECK-NEXT: Expand Atomic instructions
				; CHECK-NEXT: Dominator Tree Construction
				; CHECK-NEXT: Natural Loop Information
				; CHECK-NEXT: Lower AMX intrinsics
	; CHECK-NEXT: Lower AMX type for load/store			; CHECK-NEXT: Lower AMX type for load/store
	; CHECK-NEXT: Module Verifier			; CHECK-NEXT: Module Verifier
	; CHECK-NEXT: Lower Garbage Collection Instructions			; CHECK-NEXT: Lower Garbage Collection Instructions
	; CHECK-NEXT: Shadow Stack GC Lowering			; CHECK-NEXT: Shadow Stack GC Lowering
	; CHECK-NEXT: Lower constant intrinsics			; CHECK-NEXT: Lower constant intrinsics
	; CHECK-NEXT: Remove unreachable blocks from the CFG			; CHECK-NEXT: Remove unreachable blocks from the CFG
	; CHECK-NEXT: Scalarize Masked Memory Intrinsics			; CHECK-NEXT: Scalarize Masked Memory Intrinsics
	; CHECK-NEXT: Expand reduction intrinsics			; CHECK-NEXT: Expand reduction intrinsics
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/opt-pipeline.ll

	Show All 18 Lines
	; CHECK-NEXT: Assumption Cache Tracker			; CHECK-NEXT: Assumption Cache Tracker
	; CHECK-NEXT: Profile summary info			; CHECK-NEXT: Profile summary info
	; CHECK-NEXT: Create Garbage Collector Module Metadata			; CHECK-NEXT: Create Garbage Collector Module Metadata
	; CHECK-NEXT: Machine Branch Probability Analysis			; CHECK-NEXT: Machine Branch Probability Analysis
	; CHECK-NEXT: ModulePass Manager			; CHECK-NEXT: ModulePass Manager
	; CHECK-NEXT: Pre-ISel Intrinsic Lowering			; CHECK-NEXT: Pre-ISel Intrinsic Lowering
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	; CHECK-NEXT: Expand Atomic instructions			; CHECK-NEXT: Expand Atomic instructions
				; CHECK-NEXT: Dominator Tree Construction
				; CHECK-NEXT: Natural Loop Information
				; CHECK-NEXT: Lower AMX intrinsics
	; CHECK-NEXT: Lower AMX type for load/store			; CHECK-NEXT: Lower AMX type for load/store
	; CHECK-NEXT: Module Verifier			; CHECK-NEXT: Module Verifier
	; CHECK-NEXT: Dominator Tree Construction
	; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)			; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)
	; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: Canonicalize natural loops			; CHECK-NEXT: Canonicalize natural loops
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	; CHECK-NEXT: Loop Pass Manager			; CHECK-NEXT: Loop Pass Manager
	; CHECK-NEXT: Canonicalize Freeze Instructions in Loops			; CHECK-NEXT: Canonicalize Freeze Instructions in Loops
	; CHECK-NEXT: Induction Variable Users			; CHECK-NEXT: Induction Variable Users
	; CHECK-NEXT: Loop Strength Reduction			; CHECK-NEXT: Loop Strength Reduction
	; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)			; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)
	; CHECK-NEXT: Function Alias Analysis Results			; CHECK-NEXT: Function Alias Analysis Results
	▲ Show 20 Lines • Show All 168 Lines • Show Last 20 Lines

llvm/tools/opt/opt.cpp

Show First 20 Lines • Show All 496 Lines • ▼ Show 20 Lines	static bool shouldPinPassToLegacyPM(StringRef Pass) {
if (llvm::is_contained(PassNameExactToIgnore, Pass))		if (llvm::is_contained(PassNameExactToIgnore, Pass))
return false;		return false;

std::vector<StringRef> PassNamePrefix = {		std::vector<StringRef> PassNamePrefix = {
"x86-", "xcore-", "wasm-", "systemz-", "ppc-", "nvvm-", "nvptx-",		"x86-", "xcore-", "wasm-", "systemz-", "ppc-", "nvvm-", "nvptx-",
"mips-", "lanai-", "hexagon-", "bpf-", "avr-", "thumb2-", "arm-",		"mips-", "lanai-", "hexagon-", "bpf-", "avr-", "thumb2-", "arm-",
"si-", "gcn-", "amdgpu-", "aarch64-", "amdgcn-", "polly-"};		"si-", "gcn-", "amdgpu-", "aarch64-", "amdgcn-", "polly-"};
std::vector<StringRef> PassNameContain = {"ehprepare"};		std::vector<StringRef> PassNameContain = {"ehprepare"};
std::vector<StringRef> PassNameExact = {		std::vector<StringRef> PassNameExact = {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - std::vector<StringRef> PassNameExact = { - "safe-stack", "cost-model", - "codegenprepare", "interleaved-load-combine", - "unreachableblockelim", "verify-safepoint-ir", - "atomic-expand", - "hardware-loops", "type-promotion", - "mve-tail-predication", "interleaved-access", - "global-merge", "pre-isel-intrinsic-lowering", - "expand-reductions", "indirectbr-expand", - "generic-to-nvvm", "expandmemcmp", 25 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - std::vector<StringRef> PassNameExact = {…
"safe-stack", "cost-model",		"safe-stack", "cost-model",
"codegenprepare", "interleaved-load-combine",		"codegenprepare", "interleaved-load-combine",
"unreachableblockelim", "verify-safepoint-ir",		"unreachableblockelim", "verify-safepoint-ir",
"atomic-expand",		"atomic-expand",
"hardware-loops", "type-promotion",		"hardware-loops", "type-promotion",
"mve-tail-predication", "interleaved-access",		"mve-tail-predication", "interleaved-access",
"global-merge", "pre-isel-intrinsic-lowering",		"global-merge", "pre-isel-intrinsic-lowering",
"expand-reductions", "indirectbr-expand",		"expand-reductions", "indirectbr-expand",
"generic-to-nvvm", "expandmemcmp",		"generic-to-nvvm", "expandmemcmp",
"loop-reduce", "lower-amx-type",		"loop-reduce", "lower-amx-type",
"polyhedral-info", "replace-with-veclib"};		"lower-amx-intrinsics" ,"polyhedral-info",
		"replace-with-veclib"};
for (const auto &P : PassNamePrefix)		for (const auto &P : PassNamePrefix)
if (Pass.startswith(P))		if (Pass.startswith(P))
return true;		return true;
for (const auto &P : PassNameContain)		for (const auto &P : PassNameContain)
if (Pass.contains(P))		if (Pass.contains(P))
return true;		return true;
return llvm::is_contained(PassNameExact, Pass);		return llvm::is_contained(PassNameExact, Pass);
}		}
▲ Show 20 Lines • Show All 577 Lines • Show Last 20 Lines