Page MenuHomePhabricator

[X86] Pass to transform amx intrinsics to scalar operation.
Needs ReviewPublic

Authored by yubing on Dec 20 2020, 4:58 AM.

Details

Summary

This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
should improve fast register allocation to allocate amx register.

Diff Detail

Unit TestsFailed

TimeTest
70 msx64 debian > LLVM.CodeGen/X86/AMX::amx-low-intrinsics-no-amx-bitcast.ll
Script: -- : 'RUN: at line 2'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -mtriple=x86_64 -lower-amx-intrinsics /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll -S | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck --allow-unused-prefixes=false /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
70 msx64 debian > LLVM.CodeGen/X86/AMX::amx-low-intrinsics.ll
Script: -- : 'RUN: at line 2'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -mtriple=x86_64 -lower-amx-intrinsics /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll -S | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck --allow-unused-prefixes=false /mnt/disks/ssd0/agent/llvm-project/llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
330 msx64 debian > libarcher.races::task-dependency.c
Script: -- : 'RUN: at line 13'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-dependency.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/deflake.bash /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-dependency.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-dependency.c
340 msx64 debian > libarcher.races::task-taskgroup-unrelated.c
Script: -- : 'RUN: at line 13'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskgroup-unrelated.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskgroup-unrelated.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/deflake.bash /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskgroup-unrelated.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskgroup-unrelated.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskgroup-unrelated.c
350 msx64 debian > libarcher.races::task-taskwait-nested.c
Script: -- : 'RUN: at line 13'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -g -O1 -fsanitize=thread -I /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests -I /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/llvm-project/build/lib -Wl,-rpath,/mnt/disks/ssd0/agent/llvm-project/build/lib /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskwait-nested.c -o /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskwait-nested.c.tmp -latomic && env TSAN_OPTIONS='ignore_noninstrumented_modules=0:ignore_noninstrumented_modules=1' /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/deflake.bash /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskwait-nested.c.tmp 2>&1 | tee /mnt/disks/ssd0/agent/llvm-project/build/projects/openmp/tools/archer/tests/races/Output/task-taskwait-nested.c.tmp.log | /mnt/disks/ssd0/agent/llvm-project/build/./bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/openmp/tools/archer/tests/races/task-taskwait-nested.c
View Full Test Results (15 Failed)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
yubing updated this revision to Diff 319797.Jan 28 2021, 2:15 AM

Fix some bugs in lowerTileDPBSSD, lowerTileStore, lowerTileLoad

Herald added a project: Restricted Project. · View Herald TranscriptJan 28 2021, 2:15 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Would you rebase to see if the lit test failure is related to this patch?

yubing updated this revision to Diff 321673.Fri, Feb 5, 1:04 AM

Rebase and fix the bug in amx_api.c

yubing added a comment.Fri, Feb 5, 2:11 AM

Strange. llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll can pass in my local machine.

yubing added inline comments.Mon, Feb 8, 7:53 PM
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
78

Sorry, there is a bug here. According to AMX's spec, dst's remaining part should be all zero.

xiangzhangllvm added inline comments.Tue, Feb 9, 1:12 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
357

I see you need force match bitcast then replace, add assert for no bitcast case

472

'|' is bits or, use logic ||

pengfei added inline comments.Tue, Feb 9, 1:45 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
434

bool C = false

440

We can use a forward order to iterate it.
Besides, we cannot assume there always be bitcast after e.g. x86_tileloadd64_internal. So we need to insert one bitcast as required.

471

Remove the {} for single line loop.

507

You can just return it by return LAT.visit().

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics.ll
61

Maybe we can use zero mask load in future optimization.

LuoYuanke added inline comments.Tue, Feb 9, 4:24 AM
llvm/lib/Target/X86/X86TargetMachine.cpp
420

We may add both pass anyway and skip the pass based on the option level and option attribute in the two passes.

xiangzhangllvm added inline comments.Tue, Feb 9, 4:45 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213

In fact, no need handle Row, Col, K here, just use fix size 16x16, the result of calculation is some in effective area. (just need tileload "keep" the "unused" area is 0).
Then can use vector to handle all of the them, let type legalization to split the type.

yubing added inline comments.Fri, Feb 19, 9:46 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
212–213

We should keep the code here. In bf16, since +0.0(0x0000) * negative float is equal to -0.0(0x8000), following your solution is not able to ensure outer edge is allzero.

yubing updated this revision to Diff 325152.Fri, Feb 19, 9:48 PM

Address the commments above.

yubing updated this revision to Diff 325155.Fri, Feb 19, 9:59 PM
yubing marked 5 inline comments as done.

Small fix for some code

yubing marked an inline comment as done.Fri, Feb 19, 10:01 PM
LuoYuanke added inline comments.Sat, Feb 20, 12:57 AM
llvm/include/llvm/CodeGen/Passes.h
496

Add comments to describe what the pass does?

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2

This seems wrong file name.

12

Type 'able'.

160

Not sure if we can extract the common code of createTileLoadLoops and createTileStoreLoops, so that it can be used by both and some other functions.

251

Delete the dead code.

268

It should be in another line.

281

Better to be in a new line.

293

Better to be in a new line.

373

The name seems not good. Is "PreBuilder" better? And why we need two builder in the function?

378

Maybe use right shift instruction which is more efficient. Don't the following pass can optimize the operation.

412

Is "PreBuilder" better?

416

Shift?

449

PreBuilder?

488

Do we iterate the instructions in topology order or in post order?

pengfei added inline comments.Sun, Feb 21, 7:22 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
491

Should be better to use

if (auto *Inst  = dyn_cast<IntrinsicInst>&*II++)
  switch (Inst->getIntrinsicID()) {
  case Intrinsic::x86_tdpbssd_internal:
  ...
502

ditto

yubing updated this revision to Diff 325670.Mon, Feb 22, 9:50 PM

Address comments above

yubing marked 13 inline comments as done.Mon, Feb 22, 9:54 PM
yubing added inline comments.
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
488

It should be pre-order since we need to handle cases without bitcasts, such as, amx-low-intrinsics-no-bitcast.ll

yubing updated this revision to Diff 325964.Tue, Feb 23, 7:18 PM

Fix some comments and commit message

yubing marked an inline comment as done.Tue, Feb 23, 7:19 PM
yubing edited the summary of this revision. (Show Details)Tue, Feb 23, 7:22 PM
pengfei added inline comments.Wed, Feb 24, 5:03 AM
llvm/include/llvm/CodeGen/Passes.h
496

transforms

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2

We usually comment as ===--- filename - description ---===
See head -n1 llvm/lib/Target/X86/*.cpp

52

Ctx

89

Can we just use template <bool IsLoad>? I think it also can reduce the branch.

100

Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic llvm.masked.load/store to reduce the inner loop.

167

Maybe we can just use cast to help to raise the assertion.

224

You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.

230

ditto

232

Should check it is V256I32?

233

ditto

289

eltc?

312

Is it necessary to insert the ResElt to VecC?

341

TileLoadStore

342

Forgot to remove?

344

ditto

388

ditto

392

ditto

llvm/lib/Target/X86/X86LowerAMXType.cpp
333

ditto

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
1 ↗(On Diff #325964)

Better name it amx-low-intrinsics-no-amx-bitcast.ll

13 ↗(On Diff #325964)

It seems the body block is not necessary

19 ↗(On Diff #325964)

ditto. The lable TILELOAD_SCALARIZE_COLS_BODY even not been used.

31 ↗(On Diff #325964)

I think cols.latch is not necessary either.

yubing added inline comments.Wed, Feb 24, 6:50 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct:
Cij =Cij+Ai1.*B1j
Cij =Cij+Ai2.*B2j
....
Cij =Cij+AiK.*BKj

pengfei added inline comments.Wed, Feb 24, 7:37 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

But you don't need to update both C and D. Something like the psudo code should enough:

for (k : K)
  Dij += Aik * Bkj;
Dij += Cij
LuoYuanke added inline comments.Sat, Feb 27, 4:49 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89

Why do we need a template instead of passing a parameter bool IsLoad?

pengfei added inline comments.Sat, Feb 27, 5:36 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89

Bing thought template instantiation can avoid the condition code to turn into branch instructions.

LuoYuanke added inline comments.Sat, Feb 27, 5:40 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89

That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)

yubing added inline comments.Sun, Feb 28, 9:09 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
100

I think We can compose a follow-up patch for this optimization

yubing updated this revision to Diff 327362.Mon, Mar 1, 11:13 PM
yubing edited the summary of this revision. (Show Details)

address comments above

yubing marked 15 inline comments as done.Mon, Mar 1, 11:21 PM
yubing added inline comments.
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

I change code into the following style, and it can also reduce inner loop's size:

for (k : K)
  Cij += Aik * Bkj;
Dij = Cij

Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops.

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
13 ↗(On Diff #325964)

In fact, ISEL PASS can merge basicblocks together.

LuoYuanke added inline comments.Tue, Mar 2, 1:24 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C.