This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
2/3
Passes.h
-
lib/Target/X86/
-
Target/
-
X86/
-
CMakeLists.txt
-
X86.h
37/58
X86LowerAMXIntrinsics.cpp
2/3
X86TargetMachine.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
AMX/
1/2
amx-low-intrinsics.ll
-
O0-pipeline.ll

Differential D93594

[X86] Pass to transform amx intrinsics to scalar operation.
ClosedPublic

Authored by yubing on Dec 20 2020, 4:58 AM.

Download Raw Diff

Details

Reviewers

LuoYuanke
pengfei
xiangzhangllvm
craig.topper

Commits

rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation.
rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation.

Summary

This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
should improve fast register allocation to allocate amx register.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	80 ms	x64 windows > LLVM.CodeGen/XCore::threads.ll

Event Timeline

LuoYuanke created this revision.Dec 20 2020, 4:58 AM

Herald added subscribers: nikic, pengfei, hiraditya, mgorny. · View Herald TranscriptDec 20 2020, 4:58 AM

LuoYuanke requested review of this revision.Dec 20 2020, 4:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2020, 4:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

LuoYuanke added a parent revision: D91927: [X86] Add x86_amx type for intel AMX..Dec 20 2020, 5:00 AM

LuoYuanke added subscribers: annita.zhang, LiuChen3.

Harbormaster completed remote builds in B83067: Diff 312970.Dec 20 2020, 5:49 AM

craig.topper added a subscriber: craig.topper.Dec 20 2020, 11:54 AM

craig.topper added inline comments.

llvm/lib/Target/X86/X86TargetMachine.cpp
415	I don't think you can detect O0 this way. A function can have the optnone attribute in the non-O0 pipeline and won't be optimized by the middle end. This can occur if you mix an 00 translation unit and an O3 translation unit in LTO and use O3 for the LTO pipeline.

LuoYuanke added inline comments.Dec 21 2020, 4:05 AM

llvm/lib/Target/X86/X86TargetMachine.cpp
415	@craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass and determine if the amx intrinsics in the function need to be scalarized?

Address Craig's comments and fix clang format issue.

Add test for fucntions without attribute optone.

Harbormaster completed remote builds in B83139: Diff 313085.Dec 21 2020, 5:51 AM

Harbormaster completed remote builds in B83138: Diff 313084.Dec 21 2020, 6:14 AM

Scalarize tilestore.

Harbormaster completed remote builds in B83364: Diff 313492.Dec 23 2020, 12:45 AM

Support tile_zero and fix bugs for tile_load and tile_store.

LuoYuanke added a subscriber: yubing.Jan 12 2021, 10:01 PM

Harbormaster completed remote builds in B84973: Diff 316320.Jan 12 2021, 10:31 PM

yubing commandeered this revision.Jan 28 2021, 12:10 AM

yubing added a reviewer: LuoYuanke.

Fix some bugs in lowerTileDPBSSD, lowerTileStore, lowerTileLoad

Herald added a project: Restricted Project. · View Herald TranscriptJan 28 2021, 2:15 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B86986: Diff 319797.Jan 28 2021, 3:40 AM

yubing added a reviewer: pengfei.Feb 4 2021, 11:57 PM

LuoYuanke added a reviewer: xiangzhangllvm.Feb 5 2021, 12:00 AM

LuoYuanke added a reviewer: craig.topper.Feb 5 2021, 12:02 AM

Would you rebase to see if the lit test failure is related to this patch?

Rebase and fix the bug in amx_api.c

yubing mentioned this in D96110: [X86] Pass to transform tdpbf16ps intrinsics to scalar operation..Feb 5 2021, 1:32 AM

Harbormaster completed remote builds in B88039: Diff 321673.Feb 5 2021, 1:42 AM

Strange. llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll can pass in my local machine.

yubing added inline comments.Feb 8 2021, 7:53 PM

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
78	Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero.

xiangzhangllvm added inline comments.Feb 9 2021, 1:12 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
357	I see you need force match bitcast then replace, add assert for no bitcast case
472	'\|' is bits or, use logic \|\|

pengfei added inline comments.Feb 9 2021, 1:45 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
434	`bool C = false`
440	We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast after e.g. x86_tileloadd64_internal. So we need to insert one bitcast as required.
471	Remove the `{}` for single line loop.
507	You can just return it by `return LAT.visit()`.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
61	Maybe we can use zero mask load in future optimization.

LuoYuanke added inline comments.Feb 9 2021, 4:24 AM

llvm/lib/Target/X86/X86TargetMachine.cpp
416	We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes.

xiangzhangllvm added inline comments.Feb 9 2021, 4:45 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213	In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0). Then can use vector to handle all of the them, let type legalization to split the type.

yubing added inline comments.Feb 19 2021, 9:46 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213	We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero.

Address the commments above.

Small fix for some code

yubing marked an inline comment as done.Feb 19 2021, 10:01 PM

Harbormaster completed remote builds in B90020: Diff 325152.Feb 19 2021, 10:49 PM

Harbormaster completed remote builds in B90023: Diff 325155.Feb 19 2021, 11:20 PM

LuoYuanke added inline comments.Feb 20 2021, 12:57 AM

llvm/include/llvm/CodeGen/Passes.h
493	Add comments to describe what the pass does?
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2	This seems wrong file name.
12	Type 'able'.
160	Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions.
251	Delete the dead code.
268	It should be in another line.
281	Better to be in a new line.
293	Better to be in a new line.
373	The name seems not good. Is "PreBuilder" better? And why we need two builder in the function?
378	Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation.
412	Is "PreBuilder" better?
416	Shift?
449	PreBuilder?
488	Do we iterate the instructions in topology order or in post order?

pengfei added inline comments.Feb 21 2021, 7:22 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
491	Should be better to use if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst->getIntrinsicID()) { case Intrinsic::x86_tdpbssd_internal: ...
502	ditto

Address comments above

yubing marked 13 inline comments as done.Feb 22 2021, 9:54 PM

yubing added inline comments.

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
488	It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll

Harbormaster completed remote builds in B90331: Diff 325670.Feb 22 2021, 10:47 PM

Fix some comments and commit message

yubing marked an inline comment as done.Feb 23 2021, 7:19 PM

yubing edited the summary of this revision. (Show Details)Feb 23 2021, 7:22 PM

Harbormaster completed remote builds in B90524: Diff 325964.Feb 23 2021, 9:59 PM

pengfei added inline comments.Feb 24 2021, 5:03 AM

llvm/include/llvm/CodeGen/Passes.h
493	transforms
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2	We usually comment as ===--- filename - description ---=== See `head -n1 llvm/lib/Target/X86/*.cpp`
52	Ctx
89	Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.
100	Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop.
167	Maybe we can just use cast to help to raise the assertion.
224	You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.
230	ditto
232	Should check it is V256I32?
233	ditto
289	eltc?
312	Is it necessary to insert the ResElt to VecC?
341	TileLoadStore
342	Forgot to remove?
344	ditto
388	ditto
392	ditto
llvm/lib/Target/X86/X86LowerAMXType.cpp
333 ↗	(On Diff #325964)	ditto
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
1 ↗	(On Diff #325964)	Better name it amx-low-intrinsics-no-amx-bitcast.ll
13 ↗	(On Diff #325964)	It seems the body block is not necessary
19 ↗	(On Diff #325964)	ditto. The lable `TILELOAD_SCALARIZE_COLS_BODY` even not been used.
31 ↗	(On Diff #325964)	I think cols.latch is not necessary either.

yubing added inline comments.Feb 24 2021, 6:50 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct: Cij =Cij+Ai1.B1j Cij =Cij+Ai2.B2j .... Cij =Cij+AiK.*BKj

pengfei added inline comments.Feb 24 2021, 7:37 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	But you don't need to update both C and D. Something like the psudo code should enough: for (k : K) Dij += Aik * Bkj; Dij += Cij

LuoYuanke added inline comments.Feb 27 2021, 4:49 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	Why do we need a template instead of passing a parameter `bool IsLoad`?

pengfei added inline comments.Feb 27 2021, 5:36 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	Bing thought template instantiation can avoid the condition code to turn into branch instructions.

LuoYuanke added inline comments.Feb 27 2021, 5:40 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89	That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)

yubing added inline comments.Feb 28 2021, 9:09 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
100	I think We can compose a follow-up patch for this optimization

address comments above

yubing marked 15 inline comments as done.Mar 1 2021, 11:21 PM

yubing added inline comments.

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	I change code into the following style, and it can also reduce inner loop's size: for (k : K) Cij += Aik * Bkj; Dij = Cij Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
13 ↗	(On Diff #325964)	In fact, ISEL PASS can merge basicblocks together.

LuoYuanke added inline comments.Mar 2 2021, 1:24 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C.

Harbormaster completed remote builds in B91505: Diff 327362.Mar 2 2021, 2:04 AM

yubing added inline comments.Mar 2 2021, 7:30 PM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312	But your solution still need to update D so D's phi will be kept in the inner loops.

LGTM with some nitpicks 😊

llvm/include/llvm/CodeGen/Passes.h
493	transforms
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
10	operations
11	We always enable it. Also need to mention optnone.
47	Curly brackets are not necessary here.
83	Do we need to remove the successor? Isn't it still being dominated?
97	IsTileLoad
118	Use 1 directly?
122	Better use the same naming conversion, i.e. `ColLoopHeader`
126–127	Better to change the order, e.g. Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty = FixedVectorType::get(EltTy, 256);
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
1 ↗	(On Diff #327362)	I think we should move the files to llvm/test/Transforms/
llvm/test/CodeGen/X86/AMX/amx-type.ll
2 ↗	(On Diff #327362)	Why adding this? Is it O2 by default?

This revision is now accepted and ready to land.Mar 3 2021, 5:52 AM

LuoYuanke added inline comments.Mar 4 2021, 5:09 AM

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
83	I think this is to remove edge from preheader to tmp, because we insert a loop between them.
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
1 ↗	(On Diff #327362)	Not sure about it. Our .cpp code is under lib/Target/X86/ folder.
llvm/test/CodeGen/X86/AMX/amx-type.ll
2 ↗	(On Diff #327362)	I think this is to test with opt level 2 this pass do nothing.

Address pengfei's comments

LGTM too.

This revision was landed with ongoing or failed builds.Mar 5 2021, 12:02 AM

Closed by commit rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation. (authored by LuoYuanke, committed by yubing). · Explain Why

This revision was automatically updated to reflect the committed changes.

yubing added a commit: rG8198d83965ba: [X86] Pass to transform amx intrinsics to scalar operation..

This seems to break the build https://buildkite.com/mlir/mlir-core/builds/12026#91ec4dfe-542f-4312-92db-7d555f05ce06.

I could repro locally, reverting locally fixes the build.

Please address, thanks!

RKSimon added a reverting change: rG3fd2fa122059: Revert rG8198d83965ba4b9db6922b44ef3041030b2bac39: "[X86] Pass to transform amx….Mar 5 2021, 3:09 AM

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

In D93594#2606157, @RKSimon wrote:

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Thanks – I was just about to point out this broke downstream testing too.

Thanks all for reporting and reverting this.

Harbormaster completed remote builds in B92239: Diff 328408.Mar 5 2021, 4:55 PM

Thanks all for reporting and reverting this. I will do bugfix asap.

In D93594#2606157, @RKSimon wrote:

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Hi, @RKSimon @nicolasvasilache , it seems we haven't told libLLVMX86CodeGen.so.13git to link TransformUtils inllvm/lib/Target/X86/CMakeLists.txt, That's why we encounter buildfail.
But There is a strange thing which can be observed in build.ninja :
When I cmake with "-DBUILD_SHARED_LIBS=OFF", libLLVMX86CodeGen.a will still link lib/libLLVMTransformUtils.a.
When I cmake with "-DBUILD_SHARED_LIBS=ON", libLLVMX86CodeGen.so.13git won't link TransformUtils.
Is there any difference in build system for static library and shared library?

yubing reopened this revision.Mar 8 2021, 7:30 PM

This revision is now accepted and ready to land.Mar 8 2021, 7:30 PM

Fix buildfail when it is -DBUILD_SHARED_LIBS=ON

Harbormaster completed remote builds in B92791: Diff 329204.Mar 9 2021, 4:01 AM

This revision was landed with ongoing or failed builds.Mar 15 2021, 7:41 PM

Closed by commit rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation. (authored by yubing). · Explain Why

This revision was automatically updated to reflect the committed changes.

yubing added a commit: rG4f198b0c27b0: [X86] Pass to transform amx intrinsics to scalar operation..

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

In D93594#2628497, @nikic wrote:

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

Thanks, @nikic, I will fix it ASAP. Besides, How could I reproduce the regression?
Eh, I am asking these question because I think I should see if the repression can't be reproduced with my future bugfix.

@yubing In this case I would recommend building sqlite3.c from test-suite under perf stat and look at the instructions metric. For me the command looks like this:

perf stat CLANG_BINARY   -w -Werror=date-time -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DSQLITE_OMIT_LOAD_EXTENSION=1 -DSQLITE_THREADSAFE=0 -I. -MD -MT MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o -MF MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o.d -o MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o   -c ../MultiSource/Applications/sqlite3/sqlite3.c

You can generally get a build command using ninja -v sqlite3 in test-suite.

I can reproduce the regression. I'll help to fix it.

LuoYuanke mentioned this in D98773: [X86] Fix compile time regression of D93594..Mar 17 2021, 4:24 AM

The fix is uploaded at https://reviews.llvm.org/D98773.

davezarzycki removed a subscriber: davezarzycki.Mar 17 2021, 6:46 AM

LuoYuanke mentioned this in rGe64adc0b88c2: [X86] Fix compile time regression of D93594..Mar 18 2021, 1:53 AM

yubing mentioned this in rG113f077f808f: [X86] Pass to transform tdpbf16ps intrinsics to scalar operation..Mar 21 2021, 10:01 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

Passes.h

2 lines

lib/

Target/

X86/

CMakeLists.txt

1 line

X86.h

1 line

X86LowerAMXIntrinsics.cpp

383 lines

X86TargetMachine.cpp

8 lines

test/

CodeGen/

X86/

AMX/

amx-low-intrinsics.ll

118 lines

O0-pipeline.ll

4 lines

Diff 313084

llvm/include/llvm/CodeGen/Passes.h

Show First 20 Lines • Show All 483 Lines • ▼ Show 20 Lines	/// MachineDominanaceFrontier - This pass is a machine dominators analysis pass.

/// The pass fixups statepoint machine instruction to replace usage of		/// The pass fixups statepoint machine instruction to replace usage of
/// caller saved registers with stack slots.		/// caller saved registers with stack slots.
extern char &FixupStatepointCallerSavedID;		extern char &FixupStatepointCallerSavedID;

/// The pass transform load/store <256 x i32> to AMX load/store intrinsics		/// The pass transform load/store <256 x i32> to AMX load/store intrinsics
/// or split the data to two <128 x i32>.		/// or split the data to two <128 x i32>.
FunctionPass *createX86LowerAMXTypePass();		FunctionPass *createX86LowerAMXTypePass();

		FunctionPass *createX86LowerAMXIntrinsicsPass();
		LuoYuankeUnsubmitted Done Reply Inline Actions Add comments to describe what the pass does? LuoYuanke: Add comments to describe what the pass does?
		pengfeiUnsubmitted Done Reply Inline Actions transforms pengfei: transforms
		pengfeiUnsubmitted Not Done Reply Inline Actions transforms pengfei: transforms
} // End llvm namespace		} // End llvm namespace

#endif		#endif

llvm/lib/Target/X86/CMakeLists.txt

Show All 27 Lines	set(sources
X86AvoidTrailingCall.cpp		X86AvoidTrailingCall.cpp
X86CallFrameOptimization.cpp		X86CallFrameOptimization.cpp
X86CallingConv.cpp		X86CallingConv.cpp
X86CallLowering.cpp		X86CallLowering.cpp
X86CmovConversion.cpp		X86CmovConversion.cpp
X86DomainReassignment.cpp		X86DomainReassignment.cpp
X86DiscriminateMemOps.cpp		X86DiscriminateMemOps.cpp
X86LowerAMXType.cpp		X86LowerAMXType.cpp
		X86LowerAMXIntrinsics.cpp
X86TileConfig.cpp		X86TileConfig.cpp
X86PreTileConfig.cpp		X86PreTileConfig.cpp
X86ExpandPseudo.cpp		X86ExpandPseudo.cpp
X86FastISel.cpp		X86FastISel.cpp
X86FixupBWInsts.cpp		X86FixupBWInsts.cpp
X86FixupLEAs.cpp		X86FixupLEAs.cpp
X86AvoidStoreForwardingBlocks.cpp		X86AvoidStoreForwardingBlocks.cpp
X86FixupSetCC.cpp		X86FixupSetCC.cpp
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86.h

	Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines
	void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);			void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);
	void initializeX86OptimizeLEAPassPass(PassRegistry &);			void initializeX86OptimizeLEAPassPass(PassRegistry &);
	void initializeX86PartialReductionPass(PassRegistry &);			void initializeX86PartialReductionPass(PassRegistry &);
	void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);			void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);
	void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);			void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);
	void initializeX86PreTileConfigPass(PassRegistry &);			void initializeX86PreTileConfigPass(PassRegistry &);
	void initializeX86TileConfigPass(PassRegistry &);			void initializeX86TileConfigPass(PassRegistry &);
	void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);			void initializeX86LowerAMXTypeLegacyPassPass(PassRegistry &);
				void initializeX86LowerAMXIntrinsicsLegacyPassPass(PassRegistry &);

	namespace X86AS {			namespace X86AS {
	enum : unsigned {			enum : unsigned {
	GS = 256,			GS = 256,
	FS = 257,			FS = 257,
	SS = 258,			SS = 258,
	PTR32_SPTR = 270,			PTR32_SPTR = 270,
	PTR32_UPTR = 271,			PTR32_UPTR = 271,
	PTR64 = 272			PTR64 = 272
	};			};
	} // End X86AS namespace			} // End X86AS namespace

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp

This file was added.

				//===- llvm/CodeGen/TileShapeInfo.h - ---------------------------- C++ --===//
				//
				LuoYuankeUnsubmitted Done Reply Inline Actions This seems wrong file name. LuoYuanke: This seems wrong file name.
				pengfeiUnsubmitted Done Reply Inline Actions We usually comment as ===--- filename - description ---=== See `head -n1 llvm/lib/Target/X86/.cpp` pengfei:* We usually comment as //===--- filename - description ---===// See `head -n1…
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file Pass to transform amx intrinsics to scalar operation.
				/// This pass is only enabled with -O0. With -O0, the def of shape to amx
				pengfeiUnsubmitted Not Done Reply Inline Actions operations pengfei: operations
				/// intrinsics is near the amx intrinsics code. We are not bale to find a
				pengfeiUnsubmitted Not Done Reply Inline Actions We always enable it. Also need to mention optnone. pengfei: We always enable it. Also need to mention optnone.
				/// point which post-dominate all the shape and dominate all amx intrinsics.
				LuoYuankeUnsubmitted Done Reply Inline Actions Type 'able'. LuoYuanke: Type 'able'.
				/// To decouple the dependency of the shape, we transform amx intrinsics
				/// to scalar operation, so that compiling doesn't fail. In long term, we
				/// should improve fast register allocation to allocate amx register.
				//===----------------------------------------------------------------------===//
				//
				#include "X86.h"
				#include "llvm/ADT/DenseSet.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/Analysis/DomTreeUpdater.h"
				#include "llvm/Analysis/OptimizationRemarkEmitter.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/CodeGen/Passes.h"
				#include "llvm/CodeGen/ValueTypes.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/IntrinsicsX86.h"
				#include "llvm/IR/PatternMatch.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Pass.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/LoopUtils.h"

				using namespace llvm;
				using namespace PatternMatch;

				#define DEBUG_TYPE "lower-amx-intrinsics"

				static BasicBlock createLoop(BasicBlock Preheader, BasicBlock *Exit,
				Value Bound, Value Step, StringRef Name,
				IRBuilderBase &B, DomTreeUpdater &DTU, Loop *L,
				LoopInfo &LI) {
				LLVMContext &Ctx = Preheader->getContext();
				pengfeiUnsubmitted Not Done Reply Inline Actions Curly brackets are not necessary here. pengfei: Curly brackets are not necessary here.
				BasicBlock *Header = BasicBlock::Create(
				Preheader->getContext(), Name + ".header", Preheader->getParent(), Exit);
				BasicBlock *Body = BasicBlock::Create(Header->getContext(), Name + ".body",
				Header->getParent(), Exit);
				BasicBlock *Latch = BasicBlock::Create(Header->getContext(), Name + ".latch",
				pengfeiUnsubmitted Done Reply Inline Actions Ctx pengfei: Ctx
				Header->getParent(), Exit);

				Type *I16Ty = Type::getInt16Ty(Ctx);
				BranchInst::Create(Body, Header);
				BranchInst::Create(Latch, Body);
				PHINode *IV =
				PHINode::Create(I16Ty, 2, Name + ".iv", Header->getTerminator());
				IV->addIncoming(ConstantInt::get(I16Ty, 0), Preheader);

				B.SetInsertPoint(Latch);
				Value *Inc = B.CreateAdd(IV, Step, Name + ".step");
				Value *Cond = B.CreateICmpNE(Inc, Bound, Name + ".cond");
				BranchInst::Create(Header, Exit, Cond, Latch);
				IV->addIncoming(Inc, Latch);

				BranchInst *PreheaderBr = cast<BranchInst>(Preheader->getTerminator());
				BasicBlock *Tmp = PreheaderBr->getSuccessor(0);
				PreheaderBr->setSuccessor(0, Header);
				DTU.applyUpdatesPermissive({
				{DominatorTree::Delete, Preheader, Tmp},
				{DominatorTree::Insert, Header, Body},
				{DominatorTree::Insert, Body, Latch},
				{DominatorTree::Insert, Latch, Header},
				{DominatorTree::Insert, Latch, Exit},
				{DominatorTree::Insert, Preheader, Header},
				});

				L->addBasicBlockToLoop(Header, LI);
				L->addBasicBlockToLoop(Body, LI);
				L->addBasicBlockToLoop(Latch, LI);
				return Body;
				pengfeiUnsubmitted Not Done Reply Inline Actions Do we need to remove the successor? Isn't it still being dominated? pengfei: Do we need to remove the successor? Isn't it still being dominated?
				LuoYuankeUnsubmitted Not Done Reply Inline Actions I think this is to remove edge from preheader to tmp, because we insert a loop between them. LuoYuanke: I think this is to remove edge from preheader to tmp, because we insert a loop between them.
				}

				static Value createTileLoadLoops(BasicBlock Start, BasicBlock *End,
				IRBuilderBase &B, DomTreeUpdater &DTU,
				LoopInfo &LI, Value Row, Value Col,
				Value Ptr, Value Stride) {
				pengfeiUnsubmitted Done Reply Inline Actions Can we just use `template <bool IsLoad>`? I think it also can reduce the branch. pengfei: Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.
				LuoYuankeUnsubmitted Not Done Reply Inline Actions Why do we need a template instead of passing a parameter `bool IsLoad`? LuoYuanke: Why do we need a template instead of passing a parameter `bool IsLoad`?
				pengfeiUnsubmitted Not Done Reply Inline Actions Bing thought template instantiation can avoid the condition code to turn into branch instructions. pengfei: Bing thought template instantiation can avoid the condition code to turn into branch…
				LuoYuankeUnsubmitted Not Done Reply Inline Actions That may be arguable what benefit more. Code size saving or branch instructions avoiding. :) LuoYuanke: That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)
				Loop *RowLoop = LI.AllocateLoop();
				Loop *ColLoop = LI.AllocateLoop();
				RowLoop->addChildLoop(ColLoop);
				if (Loop *ParentL = LI.getLoopFor(Start))
				ParentL->addChildLoop(RowLoop);
				else
				LI.addTopLevelLoop(RowLoop);

				pengfeiUnsubmitted Not Done Reply Inline Actions IsTileLoad pengfei: IsTileLoad
				BasicBlock *RowBody =
				createLoop(Start, End, Row, B.getInt16(1), "rows", B, DTU, RowLoop, LI);
				BasicBlock *RowLatch = RowBody->getSingleSuccessor();
				pengfeiUnsubmitted Not Done Reply Inline Actions Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop. pengfei: Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can…
				yubingAuthorUnsubmitted Done Reply Inline Actions I think We can compose a follow-up patch for this optimization yubing: I think We can compose a follow-up patch for this optimization

				uint16_t ColStep = B.getInt32Ty()->getPrimitiveSizeInBits() / 8;
				BasicBlock *ColBody = createLoop(RowBody, RowLatch, Col, B.getInt16(ColStep),
				"cols", B, DTU, ColLoop, LI);

				BasicBlock *ColLoopLatch = ColBody->getSingleSuccessor();
				BasicBlock *ColumnLoopHeader = ColBody->getSinglePredecessor();
				BasicBlock *RowLoopHeader = RowBody->getSinglePredecessor();
				Value CurrentRow = &RowLoopHeader->begin();
				Value CurrentCol = &ColumnLoopHeader->begin();

				// cols.header:
				// %vecphi = phi [%undef, %rows.body] [%vec2, %cols.latch]
				B.SetInsertPoint(ColumnLoopHeader->getTerminator());
				FixedVectorType *V256I32Ty = FixedVectorType::get(B.getInt32Ty(), 256);
				Value *UndefVec = UndefValue::get(V256I32Ty);
				PHINode *VecPhi = B.CreatePHI(V256I32Ty, 2, "vec.phi");
				VecPhi->addIncoming(UndefVec, RowBody);
				pengfeiUnsubmitted Not Done Reply Inline Actions Use 1 directly? pengfei: Use 1 directly?

				// cols.body:
				// %elt = load i32 i32 *ptr
				// %mul = mul i16 %row.iv, i16 16
				pengfeiUnsubmitted Not Done Reply Inline Actions Better use the same naming conversion, i.e. `ColLoopHeader` pengfei: Better use the same naming conversion, i.e. `ColLoopHeader`
				// %add = add i16 %mul, i16 %col.iv
				// %vec2 = insertelement <16 x i32> %vecphi, i32 %elt, i16 %idx
				B.SetInsertPoint(ColBody->getTerminator());
				Type *EltTy = V256I32Ty->getElementType();
				Value *CurrentRowZExt = B.CreateZExt(CurrentRow, Stride->getType());
				pengfeiUnsubmitted Not Done Reply Inline Actions Better to change the order, e.g. Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty = FixedVectorType::get(EltTy, 256); pengfei: Better to change the order, e.g. ``` Type EltTy = B.getInt32Ty(); FixedVectorType V256I32Ty =…
				Value *CurrentColZExt = B.CreateZExt(CurrentCol, Stride->getType());
				Value *Offset =
				B.CreateAdd(B.CreateMul(CurrentRowZExt, Stride), CurrentColZExt);
				unsigned AS = cast<PointerType>(Ptr->getType())->getAddressSpace();
				Value *EltBasePtr = B.CreatePointerCast(Ptr, PointerType::get(EltTy, AS));
				Value *EltPtr = B.CreateGEP(EltTy, EltBasePtr, Offset);
				Value *Elt = B.CreateLoad(EltTy, EltPtr);
				Value *Idx = B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentCol);
				Value *ResVec = B.CreateInsertElement(VecPhi, Elt, Idx);
				VecPhi->addIncoming(ResVec, ColLoopLatch);

				return ResVec;
				}

				static Value createTileDPBSSDLoops(BasicBlock Start, BasicBlock *End,
				IRBuilderBase &B, DomTreeUpdater &DTU,
				LoopInfo &LI, Value Row, Value Col,
				Value K, Value Acc, Value *LHS,
				Value *RHS) {
				Loop *RowLoop = LI.AllocateLoop();
				Loop *ColLoop = LI.AllocateLoop();
				Loop *InnerLoop = LI.AllocateLoop();
				ColLoop->addChildLoop(InnerLoop);
				RowLoop->addChildLoop(ColLoop);
				if (Loop *ParentL = LI.getLoopFor(Start))
				ParentL->addChildLoop(RowLoop);
				else
				LI.addTopLevelLoop(RowLoop);

				BasicBlock *RowBody =
				createLoop(Start, End, Row, B.getInt16(1), "rows", B, DTU, RowLoop, LI);
				BasicBlock *RowLatch = RowBody->getSingleSuccessor();

				LuoYuankeUnsubmitted Done Reply Inline Actions Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions. LuoYuanke: Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so…
				BasicBlock *ColBody = createLoop(RowBody, RowLatch, Col, B.getInt16(1),
				"cols", B, DTU, ColLoop, LI);
				BasicBlock *ColLoopLatch = ColBody->getSingleSuccessor();

				uint16_t KStep = B.getInt32Ty()->getPrimitiveSizeInBits() / 8;
				B.SetInsertPoint(ColBody->getTerminator());
				Value *BoundK = B.CreateUDiv(K, B.getInt16(KStep));
				pengfeiUnsubmitted Done Reply Inline Actions Maybe we can just use cast to help to raise the assertion. pengfei: Maybe we can just use cast to help to raise the assertion.
				BasicBlock *InnerBody =
				createLoop(ColBody, ColLoopLatch, BoundK, B.getInt16(1), "inner", B, DTU,
				InnerLoop, LI);

				BasicBlock *ColumnLoopHeader = ColBody->getSinglePredecessor();
				BasicBlock *RowLoopHeader = RowBody->getSinglePredecessor();
				BasicBlock *InnerLoopHeader = InnerBody->getSinglePredecessor();
				BasicBlock *InnerLoopLatch = InnerBody->getSingleSuccessor();
				Value CurrentRow = &RowLoopHeader->begin();
				Value CurrentCol = &ColumnLoopHeader->begin();
				Value CurrentInner = &InnerLoopHeader->begin();

				FixedVectorType *V256I32Ty = FixedVectorType::get(B.getInt32Ty(), 256);
				Type *EltTy = V256I32Ty->getElementType();
				Value VecC, VecA, *VecB;
				if (auto BitCast = dyn_cast<BitCastInst>(Acc))
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto BitCast' can be declared as 'auto BitCast' [llvm-qualified-auto] not useful clang-tidy: warning: variable 'VecC' is used uninitialized whenever 'if' condition is false [clang-diagnostic-sometimes-uninitialized] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto BitCast' can be declared as 'auto *BitCast' [llvm-qualified-auto]…
				VecC = BitCast->getOperand(0);
				assert(VecC->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
				// TODO else create BitCast from x86amx to v256i32.
				// Store x86amx to memory, and reload from memory
				// to vector. However with -O0, it doesn't happen.
				if (auto BitCast = dyn_cast<BitCastInst>(LHS))
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto BitCast' can be declared as 'auto BitCast' [llvm-qualified-auto] not useful clang-tidy: warning: variable 'VecA' is used uninitialized whenever 'if' condition is false [clang-diagnostic-sometimes-uninitialized] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto BitCast' can be declared as 'auto *BitCast' [llvm-qualified-auto]…
				VecA = BitCast->getOperand(0);
				assert(VecA->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
				if (auto BitCast = dyn_cast<BitCastInst>(RHS))
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto BitCast' can be declared as 'auto BitCast' [llvm-qualified-auto] not useful clang-tidy: warning: variable 'VecB' is used uninitialized whenever 'if' condition is false [clang-diagnostic-sometimes-uninitialized] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto BitCast' can be declared as 'auto *BitCast' [llvm-qualified-auto]…
				VecB = BitCast->getOperand(0);
				assert(VecB->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");

				// Generate PHI vector for C.
				B.SetInsertPoint(InnerLoopHeader->getTerminator());
				PHINode *VecCPhi = B.CreatePHI(V256I32Ty, 2, "vec.phi");
				VecCPhi->addIncoming(VecC, ColBody);

				// Generate accmulate multiply in innerbody.
				B.SetInsertPoint(InnerBody->getTerminator());
				Value *IdxC =
				B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentCol);
				Value *IdxA =
				B.CreateAdd(B.CreateMul(CurrentRow, B.getInt16(16)), CurrentInner);
				Value *IdxB =
				B.CreateAdd(B.CreateMul(CurrentInner, B.getInt16(16)), CurrentCol);

				FixedVectorType *V4I8Ty = FixedVectorType::get(B.getInt8Ty(), 4);
				Value *EltC = B.CreateExtractElement(VecA, IdxC);
				Value *SubVecC = B.CreateBitCast(EltC, V4I8Ty);
				Value *EltA = B.CreateExtractElement(VecA, IdxA);
				xiangzhangllvmUnsubmitted Not Done Reply Inline Actions In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0). Then can use vector to handle all of the them, let type legalization to split the type. xiangzhangllvm: In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is…
				yubingAuthorUnsubmitted Done Reply Inline Actions We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero. yubing: We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0…
				Value *SubVecA = B.CreateBitCast(EltA, V4I8Ty);
				Value *EltB = B.CreateExtractElement(VecA, IdxB);
				Value *SubVecB = B.CreateBitCast(EltB, V4I8Ty);
				Value *SubVecR = B.CreateAdd(B.CreateMul(SubVecA, SubVecB), SubVecC);
				Value *ResElt = B.CreateBitCast(SubVecR, EltTy);
				Value *NewVecC = B.CreateInsertElement(VecC, ResElt, IdxC);
				VecCPhi->addIncoming(NewVecC, InnerLoopLatch);

				return NewVecC;
				}

				pengfeiUnsubmitted Done Reply Inline Actions You can use cast to help to check the failure so that VecA/B/C won't be uninitialized. pengfei: You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.
				namespace {
				class X86LowerAMXIntrinsics {
				Function &Func;

				public:
				X86LowerAMXIntrinsics(Function &F, DominatorTree DT, LoopInfo LI)
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				: Func(F), DT(DT), LI(LI) {}
				bool visit();
				pengfeiUnsubmitted Done Reply Inline Actions Should check it is V256I32? pengfei: Should check it is V256I32?

				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				private:
				DominatorTree *DT;
				LoopInfo *LI;
				bool lowerTileLoad(Instruction *TileLoad);
				bool lowerTileDPBSSD(Instruction *TileDPBSSD);
				};

				bool X86LowerAMXIntrinsics::lowerTileDPBSSD(Instruction *TileDPBSSD) {
				Value M, N, K, C, A, B;
				match(TileDPBSSD, m_Intrinsic<Intrinsic::x86_tdpbssd_internal>(
				m_Value(M), m_Value(N), m_Value(K), m_Value(C),
				m_Value(A), m_Value(B)));
				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				Instruction *InsertI = TileDPBSSD;
				BasicBlock *Start = InsertI->getParent();
				BasicBlock *End =
				SplitBlock(InsertI->getParent(), InsertI, DT, LI, nullptr, "continue");
				IRBuilder<> Builder(TileDPBSSD);
				LuoYuankeUnsubmitted Done Reply Inline Actions Delete the dead code. LuoYuanke: Delete the dead code.
				Value *ResVec =
				createTileDPBSSDLoops(Start, End, Builder, DTU, *LI, M, N, K, C, A, B);

				// Delete tileloadd6 intrinsic and bitcast instruction.
				for (auto UI = TileDPBSSD->use_begin(), UE = TileDPBSSD->use_end();
				UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(ResVec);
				I->eraseFromParent();
				}
				}
				TileDPBSSD->eraseFromParent();
				return true;
				}

				LuoYuankeUnsubmitted Done Reply Inline Actions It should be in another line. LuoYuanke: It should be in another line.
				bool X86LowerAMXIntrinsics::lowerTileLoad(Instruction *TileLoad) {
				Value M, N, Ptr, Stride;
				match(TileLoad, m_Intrinsic<Intrinsic::x86_tileloadd64_internal>(
				m_Value(M), m_Value(N), m_Value(Ptr), m_Value(Stride)));
				DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Lazy);
				Instruction *InsertI = TileLoad;
				BasicBlock *Start = InsertI->getParent();
				BasicBlock *End =
				SplitBlock(InsertI->getParent(), InsertI, DT, LI, nullptr, "continue");
				IRBuilder<> Builder(TileLoad);
				Value *ResVec =
				createTileLoadLoops(Start, End, Builder, DTU, *LI, M, N, Ptr, Stride);

				LuoYuankeUnsubmitted Done Reply Inline Actions Better to be in a new line. LuoYuanke: Better to be in a new line.
				// Delete tileloadd6 intrinsic and bitcast instruction.
				for (auto UI = TileLoad->use_begin(), UE = TileLoad->use_end(); UI != UE;) {
				Instruction *I = cast<Instruction>((UI++)->getUser());
				Value *Vec;
				if (match(I, m_BitCast(m_Value(Vec)))) {
				I->replaceAllUsesWith(ResVec);
				I->eraseFromParent();
				}
				pengfeiUnsubmitted Not Done Reply Inline Actions eltc? pengfei: eltc?
				}
				TileLoad->eraseFromParent();
				return true;
				}
				LuoYuankeUnsubmitted Done Reply Inline Actions Better to be in a new line. LuoYuanke: Better to be in a new line.

				bool X86LowerAMXIntrinsics::visit() {
				bool C;
				SmallVector<Instruction *, 8> TileDPBSSDs;
				SmallVector<Instruction *, 8> TileLoads;
				SmallVector<Instruction *, 8> TileStores;

				for (BasicBlock *BB : post_order(&Func)) {
				for (BasicBlock::reverse_iterator II = BB->rbegin(), IE = BB->rend();
				II != IE;) {
				Instruction &Inst = *II++;
				if (match(&Inst, m_Intrinsic<Intrinsic::x86_tdpbssd_internal>())) {
				// %amx1 = bitcast <256 x i32> %vec to x86_amx
				// %res = call x86_amx @llvm.x86.tdpbssd.internal(i16 m, i16 n, i16 k,
				// x86_amx, %amx1, ...)
				// %vec2 = bitcast x86_amx %res to <256 x i32>
				TileDPBSSDs.push_back(&Inst);
				} else if (match(&Inst,
				m_Intrinsic<Intrinsic::x86_tileloadd64_internal>())) {
				pengfeiUnsubmitted Not Done Reply Inline Actions Is it necessary to insert the ResElt to VecC? pengfei: Is it necessary to insert the ResElt to VecC?
				yubingAuthorUnsubmitted Done Reply Inline Actions Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct: Cij =Cij+Ai1.B1j Cij =Cij+Ai2.B2j .... Cij =Cij+AiK.BKj yubing:* Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix…
				pengfeiUnsubmitted Not Done Reply Inline Actions But you don't need to update both C and D. Something like the psudo code should enough: for (k : K) Dij += Aik * Bkj; Dij += Cij pengfei: But you don't need to update both C and D. Something like the psudo code should enough: ``` for…
				yubingAuthorUnsubmitted Done Reply Inline Actions I change code into the following style, and it can also reduce inner loop's size: for (k : K) Cij += Aik * Bkj; Dij = Cij Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops. yubing: I change code into the following style, and it can also reduce inner loop's size: ``` for (k…
				LuoYuankeUnsubmitted Not Done Reply Inline Actions It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C. LuoYuanke: It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert…
				yubingAuthorUnsubmitted Done Reply Inline Actions But your solution still need to update D so D's phi will be kept in the inner loops. yubing: But your solution still need to update D so D's phi will be kept in the inner loops.
				// %17 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %13, i16 %14,
				// i8* %15, i64 %16)
				// %18 = bitcast x86_amx %17 to <256 x i32>
				TileLoads.push_back(&Inst);
				} else if (match(&Inst,
				m_Intrinsic<Intrinsic::x86_tilestored64_internal>())) {
				// %89 = bitcast <256 x i32> %88 to x86_amx
				// call void @llvm.x86.tilestored64.internal(i16 %84, i16 %85, i8* %86,
				// i64 %87, x86_amx %89)
				// lowerTileStore();
				TileStores.push_back(&Inst);
				}
				}
				}

				for (auto *Inst : TileLoads) {
				C \|= lowerTileLoad(Inst);
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: variable 'C' is uninitialized when used here [clang-diagnostic-uninitialized] not useful Lint: Pre-merge checks: clang-tidy: warning: variable 'C' is uninitialized when used here [clang-diagnostic…
				}
				for (auto *Inst : TileDPBSSDs) {
				C \|= lowerTileDPBSSD(Inst);
				}

				return C;
				}
				} // anonymous namespace

				namespace {

				class X86LowerAMXIntrinsicsLegacyPass : public FunctionPass {
				pengfeiUnsubmitted Done Reply Inline Actions TileLoadStore pengfei: TileLoadStore
				public:
				pengfeiUnsubmitted Done Reply Inline Actions Forgot to remove? pengfei: Forgot to remove?
				static char ID;

				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				X86LowerAMXIntrinsicsLegacyPass() : FunctionPass(ID) {
				initializeX86LowerAMXIntrinsicsLegacyPassPass(
				*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override {
				if (!F.hasFnAttribute(Attribute::OptimizeNone))
				return false;

				auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();

				X86LowerAMXIntrinsics LAT(F, &DT, &LI);
				xiangzhangllvmUnsubmitted Not Done Reply Inline Actions I see you need force match bitcast then replace, add assert for no bitcast case xiangzhangllvm: I see you need force match bitcast then replace, add assert for no bitcast case
				bool C = LAT.visit();
				return C;
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.addPreserved<DominatorTreeWrapperPass>();
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addPreserved<LoopInfoWrapperPass>();
				}
				};

				} // anonymous namespace

				static const char PassName[] = "Lower AMX intrinsics";
				char X86LowerAMXIntrinsicsLegacyPass::ID = 0;
				LuoYuankeUnsubmitted Done Reply Inline Actions The name seems not good. Is "PreBuilder" better? And why we need two builder in the function? LuoYuanke: The name seems not good. Is "PreBuilder" better? And why we need two builder in the function?
				INITIALIZE_PASS_BEGIN(X86LowerAMXIntrinsicsLegacyPass, DEBUG_TYPE, PassName,
				false, false)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
				INITIALIZE_PASS_END(X86LowerAMXIntrinsicsLegacyPass, DEBUG_TYPE, PassName,
				LuoYuankeUnsubmitted Done Reply Inline Actions Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation. LuoYuanke: Maybe use right shift instruction which is more efficient. Don't the following pass can…
				false, false)

				FunctionPass *llvm::createX86LowerAMXIntrinsicsPass() {
				return new X86LowerAMXIntrinsicsLegacyPass();
				}
				xiangzhangllvmUnsubmitted Done Reply Inline Actions '\|' is bits or, use logic \|\| xiangzhangllvm: '\|' is bits or, use logic \|\|
				pengfeiUnsubmitted Done Reply Inline Actions You can just return it by `return LAT.visit()`. pengfei: You can just return it by `return LAT.visit()`.
				pengfeiUnsubmitted Done Reply Inline Actions `bool C = false` pengfei: `bool C = false`
				pengfeiUnsubmitted Done Reply Inline Actions We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast after e.g. x86_tileloadd64_internal. So we need to insert one bitcast as required. pengfei: We can use a forward order to iterate it. Besides, we cannot assume there always be bitcast…
				pengfeiUnsubmitted Done Reply Inline Actions Remove the `{}` for single line loop. pengfei: Remove the `{}` for single line loop.
				LuoYuankeUnsubmitted Done Reply Inline Actions Is "PreBuilder" better? LuoYuanke: Is "PreBuilder" better?
				LuoYuankeUnsubmitted Done Reply Inline Actions Shift? LuoYuanke: Shift?
				LuoYuankeUnsubmitted Not Done Reply Inline Actions PreBuilder? LuoYuanke: PreBuilder?
				LuoYuankeUnsubmitted Not Done Reply Inline Actions Do we iterate the instructions in topology order or in post order? LuoYuanke: Do we iterate the instructions in topology order or in post order?
				pengfeiUnsubmitted Done Reply Inline Actions Should be better to use if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst->getIntrinsicID()) { case Intrinsic::x86_tdpbssd_internal: ... pengfei: Should be better to use ``` if (auto Inst = dyn_cast<IntrinsicInst>&II++) switch (Inst…
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				pengfeiUnsubmitted Done Reply Inline Actions ditto pengfei: ditto
				yubingAuthorUnsubmitted Done Reply Inline Actions It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll yubing: It should be pre-order since we need to handle cases without bitcasts, such as, amx-low…

llvm/lib/Target/X86/X86TargetMachine.cpp

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableMachineCombinerPass("x86-machine-combiner",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {
// Register the target.		// Register the target.
RegisterTargetMachine<X86TargetMachine> X(getTheX86_32Target());		RegisterTargetMachine<X86TargetMachine> X(getTheX86_32Target());
RegisterTargetMachine<X86TargetMachine> Y(getTheX86_64Target());		RegisterTargetMachine<X86TargetMachine> Y(getTheX86_64Target());

PassRegistry &PR = *PassRegistry::getPassRegistry();		PassRegistry &PR = *PassRegistry::getPassRegistry();
		initializeX86LowerAMXIntrinsicsLegacyPassPass(PR);
initializeX86LowerAMXTypeLegacyPassPass(PR);		initializeX86LowerAMXTypeLegacyPassPass(PR);
initializeGlobalISel(PR);		initializeGlobalISel(PR);
initializeWinEHStatePassPass(PR);		initializeWinEHStatePassPass(PR);
initializeFixupBWInstPassPass(PR);		initializeFixupBWInstPassPass(PR);
initializeEvexToVexInstPassPass(PR);		initializeEvexToVexInstPassPass(PR);
initializeFixupLEAPassPass(PR);		initializeFixupLEAPassPass(PR);
initializeFPSPass(PR);		initializeFPSPass(PR);
initializeX86FixupSetCCPassPass(PR);		initializeX86FixupSetCCPassPass(PR);
▲ Show 20 Lines • Show All 332 Lines • ▼ Show 20 Lines	INITIALIZE_PASS_END(X86ExecutionDomainFix, "x86-execution-domain-fix",
"X86 Execution Domain Fix", false, false)		"X86 Execution Domain Fix", false, false)

TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *X86TargetMachine::createPassConfig(PassManagerBase &PM) {
return new X86PassConfig(*this, PM);		return new X86PassConfig(*this, PM);
}		}

void X86PassConfig::addIRPasses() {		void X86PassConfig::addIRPasses() {
addPass(createAtomicExpandPass());		addPass(createAtomicExpandPass());

		if (TM->getOptLevel() == CodeGenOpt::None)
		craig.topperUnsubmitted Not Done Reply Inline Actions I don't think you can detect O0 this way. A function can have the optnone attribute in the non-O0 pipeline and won't be optimized by the middle end. This can occur if you mix an 00 translation unit and an O3 translation unit in LTO and use O3 for the LTO pipeline. craig.topper: I don't think you can detect O0 this way. A function can have the optnone attribute in the non…
		LuoYuankeUnsubmitted Done Reply Inline Actions @craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass and determine if the amx intrinsics in the function need to be scalarized? LuoYuanke: @craig.topper, thank you! How about to check "optnone" attribute in X86LowerAMXIntrinsicsPass…
		addPass(createX86LowerAMXIntrinsicsPass());
		LuoYuankeUnsubmitted Done Reply Inline Actions We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes. LuoYuanke: We may add both pass anyway and skip the pass based on the option level and option attribute in…
		else {
addPass(createX86LowerAMXTypePass());		addPass(createX86LowerAMXTypePass());
		}

TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();

if (TM->getOptLevel() != CodeGenOpt::None) {		if (TM->getOptLevel() != CodeGenOpt::None) {
addPass(createInterleavedAccessPass());		addPass(createInterleavedAccessPass());
addPass(createX86PartialReductionPass());		addPass(createX86PartialReductionPass());
}		}

▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -lower-amx-intrinsics %s -S \| FileCheck %s

				define dso_local void @test_amx_load(i16 signext %row, i16 signext %col, i8 %ptr, i64 %stride, <256 x i32> %vptr) #0 {
				; CHECK-LABEL: @test_amx_load(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[ROWS_HEADER:%.*]]
				; CHECK: rows.header:
				; CHECK-NEXT: [[ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[ROWS_STEP:%.]], [[ROWS_LATCH:%.]] ]
				; CHECK-NEXT: br label [[ROWS_BODY:%.*]]
				; CHECK: rows.body:
				; CHECK-NEXT: br label [[COLS_HEADER:%.*]]
				; CHECK: cols.header:
				; CHECK-NEXT: [[COLS_IV:%.]] = phi i16 [ 0, [[ROWS_BODY]] ], [ [[COLS_STEP:%.]], [[COLS_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <256 x i32> [ undef, [[ROWS_BODY]] ], [ [[TMP9:%.]], [[COLS_LATCH]] ]
				; CHECK-NEXT: br label [[COLS_BODY:%.*]]
				; CHECK: cols.body:
				; CHECK-NEXT: [[TMP0:%.*]] = zext i16 [[ROWS_IV]] to i64
				; CHECK-NEXT: [[TMP1:%.*]] = zext i16 [[COLS_IV]] to i64
				; CHECK-NEXT: [[TMP2:%.]] = mul i64 [[TMP0]], [[STRIDE:%.]]
				; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[TMP2]], [[TMP1]]
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[PTR:%.]] to i32
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr i32, i32 [[TMP4]], i64 [[TMP3]]
				; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP5]], align 4
				; CHECK-NEXT: [[TMP7:%.*]] = mul i16 [[ROWS_IV]], 16
				; CHECK-NEXT: [[TMP8:%.*]] = add i16 [[TMP7]], [[COLS_IV]]
				; CHECK-NEXT: [[TMP9]] = insertelement <256 x i32> [[VEC_PHI]], i32 [[TMP6]], i16 [[TMP8]]
				; CHECK-NEXT: br label [[COLS_LATCH]]
				; CHECK: cols.latch:
				; CHECK-NEXT: [[COLS_STEP]] = add i16 [[COLS_IV]], 4
				; CHECK-NEXT: [[COLS_COND:%.]] = icmp ne i16 [[COLS_STEP]], [[COL:%.]]
				; CHECK-NEXT: br i1 [[COLS_COND]], label [[COLS_HEADER]], label [[ROWS_LATCH]]
				; CHECK: rows.latch:
				; CHECK-NEXT: [[ROWS_STEP]] = add i16 [[ROWS_IV]], 1
				; CHECK-NEXT: [[ROWS_COND:%.]] = icmp ne i16 [[ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[ROWS_COND]], label [[ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: store <256 x i32> [[TMP9]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8* %ptr, i64 %stride)
				%vec = bitcast x86_amx %amx to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				define dso_local void @test_amx_dp(i16 signext %row, i16 signext %col, i16 signext %k, <256 x i32> %c, <256 x i32> %a, <256 x i32> %b, <256 x i32>* %vptr) #0 {
				; CHECK-LABEL: @test_amx_dp(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_AMX:%.]] = bitcast <256 x i32> [[A:%.]] to x86_amx
				; CHECK-NEXT: [[B_AMX:%.]] = bitcast <256 x i32> [[B:%.]] to x86_amx
				; CHECK-NEXT: [[C_AMX:%.]] = bitcast <256 x i32> [[C:%.]] to x86_amx
				; CHECK-NEXT: br label [[ROWS_HEADER:%.*]]
				; CHECK: rows.header:
				; CHECK-NEXT: [[ROWS_IV:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[ROWS_STEP:%.]], [[ROWS_LATCH:%.]] ]
				; CHECK-NEXT: br label [[ROWS_BODY:%.*]]
				; CHECK: rows.body:
				; CHECK-NEXT: br label [[COLS_HEADER:%.*]]
				; CHECK: cols.header:
				; CHECK-NEXT: [[COLS_IV:%.]] = phi i16 [ 0, [[ROWS_BODY]] ], [ [[COLS_STEP:%.]], [[COLS_LATCH:%.*]] ]
				pengfeiUnsubmitted Not Done Reply Inline Actions Maybe we can use zero mask load in future optimization. pengfei: Maybe we can use zero mask load in future optimization.
				; CHECK-NEXT: br label [[COLS_BODY:%.*]]
				; CHECK: cols.body:
				; CHECK-NEXT: [[TMP0:%.]] = udiv i16 [[K:%.]], 4
				; CHECK-NEXT: br label [[INNER_HEADER:%.*]]
				; CHECK: inner.header:
				; CHECK-NEXT: [[INNER_IV:%.]] = phi i16 [ 0, [[COLS_BODY]] ], [ [[INNER_STEP:%.]], [[INNER_LATCH:%.*]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <256 x i32> [ [[C]], [[COLS_BODY]] ], [ [[TMP16:%.]], [[INNER_LATCH]] ]
				; CHECK-NEXT: br label [[INNER_BODY:%.*]]
				; CHECK: inner.body:
				; CHECK-NEXT: [[TMP1:%.*]] = mul i16 [[ROWS_IV]], 16
				; CHECK-NEXT: [[TMP2:%.*]] = add i16 [[TMP1]], [[COLS_IV]]
				; CHECK-NEXT: [[TMP3:%.*]] = mul i16 [[ROWS_IV]], 16
				; CHECK-NEXT: [[TMP4:%.*]] = add i16 [[TMP3]], [[INNER_IV]]
				; CHECK-NEXT: [[TMP5:%.*]] = mul i16 [[INNER_IV]], 16
				; CHECK-NEXT: [[TMP6:%.*]] = add i16 [[TMP5]], [[COLS_IV]]
				; CHECK-NEXT: [[TMP7:%.*]] = extractelement <256 x i32> [[A]], i16 [[TMP2]]
				; CHECK-NEXT: [[TMP8:%.*]] = bitcast i32 [[TMP7]] to <4 x i8>
				yubingAuthorUnsubmitted Done Reply Inline Actions Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero. yubing: Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero.
				; CHECK-NEXT: [[TMP9:%.*]] = extractelement <256 x i32> [[A]], i16 [[TMP4]]
				; CHECK-NEXT: [[TMP10:%.*]] = bitcast i32 [[TMP9]] to <4 x i8>
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <256 x i32> [[A]], i16 [[TMP6]]
				; CHECK-NEXT: [[TMP12:%.*]] = bitcast i32 [[TMP11]] to <4 x i8>
				; CHECK-NEXT: [[TMP13:%.*]] = mul <4 x i8> [[TMP10]], [[TMP12]]
				; CHECK-NEXT: [[TMP14:%.*]] = add <4 x i8> [[TMP13]], [[TMP8]]
				; CHECK-NEXT: [[TMP15:%.*]] = bitcast <4 x i8> [[TMP14]] to i32
				; CHECK-NEXT: [[TMP16]] = insertelement <256 x i32> [[C]], i32 [[TMP15]], i16 [[TMP2]]
				; CHECK-NEXT: br label [[INNER_LATCH]]
				; CHECK: inner.latch:
				; CHECK-NEXT: [[INNER_STEP]] = add i16 [[INNER_IV]], 1
				; CHECK-NEXT: [[INNER_COND:%.*]] = icmp ne i16 [[INNER_STEP]], [[TMP0]]
				; CHECK-NEXT: br i1 [[INNER_COND]], label [[INNER_HEADER]], label [[COLS_LATCH]]
				; CHECK: cols.latch:
				; CHECK-NEXT: [[COLS_STEP]] = add i16 [[COLS_IV]], 1
				; CHECK-NEXT: [[COLS_COND:%.]] = icmp ne i16 [[COLS_STEP]], [[COL:%.]]
				; CHECK-NEXT: br i1 [[COLS_COND]], label [[COLS_HEADER]], label [[ROWS_LATCH]]
				; CHECK: rows.latch:
				; CHECK-NEXT: [[ROWS_STEP]] = add i16 [[ROWS_IV]], 1
				; CHECK-NEXT: [[ROWS_COND:%.]] = icmp ne i16 [[ROWS_STEP]], [[ROW:%.]]
				; CHECK-NEXT: br i1 [[ROWS_COND]], label [[ROWS_HEADER]], label [[CONTINUE:%.*]]
				; CHECK: continue:
				; CHECK-NEXT: store <256 x i32> [[TMP16]], <256 x i32>* [[VPTR:%.*]], align 64
				; CHECK-NEXT: ret void
				;
				entry:
				%a.amx = bitcast <256 x i32> %a to x86_amx
				%b.amx = bitcast <256 x i32> %b to x86_amx
				%c.amx = bitcast <256 x i32> %c to x86_amx
				%acc = call x86_amx @llvm.x86.tdpbssd.internal(i16 %row, i16 %col, i16 %k, x86_amx %c.amx, x86_amx %a.amx, x86_amx %b.amx)
				%vec = bitcast x86_amx %acc to <256 x i32>
				store <256 x i32> %vec, <256 x i32>* %vptr, align 64
				ret void
				}

				declare x86_amx @llvm.x86.tileloadd64.internal(i16, i16, i8*, i64)
				declare x86_amx @llvm.x86.tdpbssd.internal(i16, i16, i16, x86_amx, x86_amx, x86_amx)
				declare void @llvm.x86.tilestored64.internal(i16, i16, i8*, i64, x86_amx)

				attributes #0 = { noinline nounwind optnone }

llvm/test/CodeGen/X86/O0-pipeline.ll

	Show All 12 Lines
	; CHECK-NEXT: Create Garbage Collector Module Metadata			; CHECK-NEXT: Create Garbage Collector Module Metadata
	; CHECK-NEXT: Assumption Cache Tracker			; CHECK-NEXT: Assumption Cache Tracker
	; CHECK-NEXT: Profile summary info			; CHECK-NEXT: Profile summary info
	; CHECK-NEXT: Machine Branch Probability Analysis			; CHECK-NEXT: Machine Branch Probability Analysis
	; CHECK-NEXT: ModulePass Manager			; CHECK-NEXT: ModulePass Manager
	; CHECK-NEXT: Pre-ISel Intrinsic Lowering			; CHECK-NEXT: Pre-ISel Intrinsic Lowering
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	; CHECK-NEXT: Expand Atomic instructions			; CHECK-NEXT: Expand Atomic instructions
	; CHECK-NEXT: Lower AMX type for load/store			; CHECK-NEXT: Dominator Tree Construction
				; CHECK-NEXT: Natural Loop Information
				; CHECK-NEXT: Lower AMX intrinsics
	; CHECK-NEXT: Module Verifier			; CHECK-NEXT: Module Verifier
	; CHECK-NEXT: Lower Garbage Collection Instructions			; CHECK-NEXT: Lower Garbage Collection Instructions
	; CHECK-NEXT: Shadow Stack GC Lowering			; CHECK-NEXT: Shadow Stack GC Lowering
	; CHECK-NEXT: Lower constant intrinsics			; CHECK-NEXT: Lower constant intrinsics
	; CHECK-NEXT: Remove unreachable blocks from the CFG			; CHECK-NEXT: Remove unreachable blocks from the CFG
	; CHECK-NEXT: Instrument function entry/exit with calls to e.g. mcount() (post inlining)			; CHECK-NEXT: Instrument function entry/exit with calls to e.g. mcount() (post inlining)
	; CHECK-NEXT: Scalarize Masked Memory Intrinsics			; CHECK-NEXT: Scalarize Masked Memory Intrinsics
	; CHECK-NEXT: Expand reduction intrinsics			; CHECK-NEXT: Expand reduction intrinsics
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines