Page MenuHomePhabricator

[X86] Pass to transform amx intrinsics to scalar operation.
ClosedPublic

Authored by yubing on Dec 20 2020, 4:58 AM.

Details

Summary

This pass runs in any situations but we skip it when it is not O0 and the
function doesn't have optnone attribute. With -O0, the def of shape to amx
intrinsics is near the amx intrinsics code. We are not able to find a
point which post-dominate all the shape and dominate all amx intrinsics.
To decouple the dependency of the shape, we transform amx intrinsics
to scalar operation, so that compiling doesn't fail. In long term, we
should improve fast register allocation to allocate amx register.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
yubing updated this revision to Diff 325964.Feb 23 2021, 7:18 PM

Fix some comments and commit message

yubing marked an inline comment as done.Feb 23 2021, 7:19 PM
yubing edited the summary of this revision. (Show Details)Feb 23 2021, 7:22 PM
pengfei added inline comments.Feb 24 2021, 5:03 AM
llvm/include/llvm/CodeGen/Passes.h
504

transforms

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
2

We usually comment as ===--- filename - description ---===
See head -n1 llvm/lib/Target/X86/*.cpp

52

Ctx

89

Can we just use template <bool IsLoad>? I think it also can reduce the branch.

100

Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic llvm.masked.load/store to reduce the inner loop.

167

Maybe we can just use cast to help to raise the assertion.

224

You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.

230

ditto

232

Should check it is V256I32?

233

ditto

289

eltc?

312

Is it necessary to insert the ResElt to VecC?

341

TileLoadStore

342

Forgot to remove?

344

ditto

388

ditto

392

ditto

llvm/lib/Target/X86/X86LowerAMXType.cpp
337

ditto

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
1 ↗(On Diff #325964)

Better name it amx-low-intrinsics-no-amx-bitcast.ll

13 ↗(On Diff #325964)

It seems the body block is not necessary

19 ↗(On Diff #325964)

ditto. The lable TILELOAD_SCALARIZE_COLS_BODY even not been used.

31 ↗(On Diff #325964)

I think cols.latch is not necessary either.

yubing added inline comments.Feb 24 2021, 6:50 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

Yes, it is necessary since you should use updated eltC(aka, Cij) when you are doing matrix dotproduct:
Cij =Cij+Ai1.*B1j
Cij =Cij+Ai2.*B2j
....
Cij =Cij+AiK.*BKj

pengfei added inline comments.Feb 24 2021, 7:37 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

But you don't need to update both C and D. Something like the psudo code should enough:

for (k : K)
  Dij += Aik * Bkj;
Dij += Cij
LuoYuanke added inline comments.Feb 27 2021, 4:49 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89

Why do we need a template instead of passing a parameter bool IsLoad?

pengfei added inline comments.Feb 27 2021, 5:36 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89

Bing thought template instantiation can avoid the condition code to turn into branch instructions.

LuoYuanke added inline comments.Feb 27 2021, 5:40 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
89

That may be arguable what benefit more. Code size saving or branch instructions avoiding. :)

yubing added inline comments.Feb 28 2021, 9:09 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
100

I think We can compose a follow-up patch for this optimization

yubing updated this revision to Diff 327362.Mar 1 2021, 11:13 PM
yubing edited the summary of this revision. (Show Details)

address comments above

yubing marked 15 inline comments as done.Mar 1 2021, 11:21 PM
yubing added inline comments.
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

I change code into the following style, and it can also reduce inner loop's size:

for (k : K)
  Cij += Aik * Bkj;
Dij = Cij

Besides, I hoist the procedure of calculating (i,j)'s linear index above inner loops.

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll
13 ↗(On Diff #325964)

In fact, ISEL PASS can merge basicblocks together.

LuoYuanke added inline comments.Mar 2 2021, 1:24 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

It seems keeping vector C unchanged is simpler. We can eliminate the phi, extract and insert instruction for vector C.

yubing added inline comments.Mar 2 2021, 7:30 PM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
312

But your solution still need to update D so D's phi will be kept in the inner loops.

pengfei accepted this revision.Mar 3 2021, 5:52 AM

LGTM with some nitpicks 😊

llvm/include/llvm/CodeGen/Passes.h
504

transforms

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
10

operations

11

We always enable it. Also need to mention optnone.

47

Curly brackets are not necessary here.

83

Do we need to remove the successor? Isn't it still being dominated?

97

IsTileLoad

118

Use 1 directly?

122

Better use the same naming conversion, i.e. ColLoopHeader

126–127

Better to change the order, e.g.

Type *EltTy = B.getInt32Ty();
FixedVectorType *V256I32Ty = FixedVectorType::get(EltTy, 256);
llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
2

I think we should move the files to llvm/test/Transforms/

llvm/test/CodeGen/X86/AMX/amx-type.ll
2

Why adding this? Is it O2 by default?

This revision is now accepted and ready to land.Mar 3 2021, 5:52 AM
LuoYuanke added inline comments.Mar 4 2021, 5:09 AM
llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
83

I think this is to remove edge from preheader to tmp, because we insert a loop between them.

llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-amx-bitcast.ll
2

Not sure about it. Our .cpp code is under lib/Target/X86/ folder.

llvm/test/CodeGen/X86/AMX/amx-type.ll
2

I think this is to test with opt level 2 this pass do nothing.

yubing updated this revision to Diff 328408.Mar 4 2021, 11:48 PM

Address pengfei's comments

This revision was landed with ongoing or failed builds.Mar 5 2021, 12:02 AM
This revision was automatically updated to reflect the committed changes.

This seems to break the build https://buildkite.com/mlir/mlir-core/builds/12026#91ec4dfe-542f-4312-92db-7d555f05ce06.

I could repro locally, reverting locally fixes the build.

Please address, thanks!

RKSimon added a subscriber: RKSimon.Mar 5 2021, 3:11 AM

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Thanks – I was just about to point out this broke downstream testing too.

Thanks all for reporting and reverting this.

yubing added a comment.Mar 6 2021, 5:13 AM

Thanks all for reporting and reverting this. I will do bugfix asap.

yubing added a comment.Mar 8 2021, 6:03 PM

@yubing I've reverted this as it was failing on a lot of buildbots: http://lab.llvm.org:8011/#/builders/109/builds/9867

Hi, @RKSimon @nicolasvasilache , it seems we haven't told libLLVMX86CodeGen.so.13git to link TransformUtils inllvm/lib/Target/X86/CMakeLists.txt, That's why we encounter buildfail.
But There is a strange thing which can be observed in build.ninja :
When I cmake with "-DBUILD_SHARED_LIBS=OFF", libLLVMX86CodeGen.a will still link lib/libLLVMTransformUtils.a.
When I cmake with "-DBUILD_SHARED_LIBS=ON", libLLVMX86CodeGen.so.13git won't link TransformUtils.
Is there any difference in build system for static library and shared library?

yubing reopened this revision.Mar 8 2021, 7:30 PM
This revision is now accepted and ready to land.Mar 8 2021, 7:30 PM
yubing updated this revision to Diff 329204.Mar 8 2021, 9:08 PM

Fix buildfail when it is -DBUILD_SHARED_LIBS=ON

This revision was landed with ongoing or failed builds.Mar 15 2021, 7:41 PM
This revision was automatically updated to reflect the committed changes.
nikic added a comment.Mar 16 2021, 3:48 AM

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

It looks like this has caused a compile-time regression at O0: https://llvm-compile-time-tracker.com/compare.php?from=9341bcbdc93a251b632ffaa51a84452a7a4a5e4e&to=4f198b0c27b04e830a3069aaf4b39cf203eaae4a&stat=instructions

The cause is probably the computation of DomTree and LoopInfo, even if no AMX intrinsics are present. I think you should be able to easily fix this by not fetching DT/LI from the pass manager, and computing them in the pass instead (only if intrinsics are present).

Thanks, @nikic, I will fix it ASAP. Besides, How could I reproduce the regression?
Eh, I am asking these question because I think I should see if the repression can't be reproduced with my future bugfix.

@yubing In this case I would recommend building sqlite3.c from test-suite under perf stat and look at the instructions metric. For me the command looks like this:

perf stat CLANG_BINARY   -w -Werror=date-time -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DSQLITE_OMIT_LOAD_EXTENSION=1 -DSQLITE_THREADSAFE=0 -I. -MD -MT MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o -MF MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o.d -o MultiSource/Applications/sqlite3/CMakeFiles/sqlite3.dir/sqlite3.c.o   -c ../MultiSource/Applications/sqlite3/sqlite3.c

You can generally get a build command using ninja -v sqlite3 in test-suite.

I can reproduce the regression. I'll help to fix it.