This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX] Multiple configure for AMX register.
ClosedPublic

Authored by LuoYuanke on May 6 2022, 12:10 AM.

Details

Summary

The previous solution depends on variable name to record the shape
information. However it is not reliable, because in release build
compiler would not set the variable name. It can be accomplished with an
additional option fno-discard-value-names, but it is not acceptable
for users.
This patch is to preconfigure the tile register with machine
instruction. It follow the same way what sigle configure does. In the
future we can fall back to multiple configure when single configure
fails due to the shape dependency issue.
The algorithm to configure the tile register is simple in the patch. We
may improve it in the future. It configure tile register based on basic
block. Compiler would spill the tile register if it live out the basic
block. After the configure there should be no spill across tile
confgiure in the register alloction. Just like fast register allocation
the algorithm walk the instruction in reverse order. When the shape
dependency doesn't meet, it insert ldtilecfg after the last instruction
that define the shape.
In post configuration compiler also walk the basic block to collect the
physical tile register number and generate instruction to fill the stack
slot for the correponding shape information.

Diff Detail

Event Timeline

LuoYuanke created this revision.May 6 2022, 12:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 12:10 AM
LuoYuanke requested review of this revision.May 6 2022, 12:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 12:10 AM
LuoYuanke updated this revision to Diff 427922.May 8 2022, 2:59 AM

Add more test cases and fix some bugs.

xiangzhangllvm added inline comments.May 9 2022, 1:22 AM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
440

Maybe better to add the comments with change steps of a simple case ahead the key functions.
(Some like the old AMX fast allocation do) That will make it easy for readers who not very familiar here code.

pengfei added inline comments.May 9 2022, 1:27 AM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
393–394

We need reload after function call too.

LuoYuanke updated this revision to Diff 429425.May 14 2022, 2:28 AM

Fix bugs related to phi.

LuoYuanke updated this revision to Diff 429458.May 14 2022, 7:44 AM

Fix lit failure.

LuoYuanke updated this revision to Diff 430008.May 17 2022, 5:37 AM

Hanlde circular dependency of phi node.

LuoYuanke updated this revision to Diff 430345.May 18 2022, 5:32 AM
  1. Fix stack slot address of circular phi.
  2. Handle config across function call.
  3. Address Phoebe and Xiang's comments.
LuoYuanke updated this revision to Diff 430372.May 18 2022, 7:16 AM

Fix bug of config across call.

LuoYuanke updated this revision to Diff 430380.May 18 2022, 7:43 AM

Fix config bug across call.

xiangzhangllvm added inline comments.May 20 2022, 2:59 AM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
47

Forget to remove ?

408

This reverse the PHI order. Seems dangerous.
There may be

p0 = PHI A, B
p1 = PHI p0, C

A B  C
 \  |  /
    D

But I didn't successful created such test case

LuoYuanke added inline comments.May 20 2022, 7:20 PM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
408

Good catch. I'll fix it.

LuoYuanke added inline comments.May 20 2022, 7:30 PM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
408

I'm not sure it is well cononical phi node, but I created such case.

xiangzhangllvm added inline comments.May 20 2022, 7:43 PM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
356

The most important change for "kill/dead" , I think, is not here. It should be the "old" last use (if it before current PHI)

357–358

It is possible to extend the row/col reg kill to here.

LuoYuanke added inline comments.May 21 2022, 4:38 AM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
356

Yes, but there should be no side effect to set false. The kill flag should be caluculated during register alloction.

357–358

The kill flag should be caluculated during register alloction. We need to anaylze the live range to know the kill flag. Here I think it is not important to set kill flag.

LuoYuanke updated this revision to Diff 431129.May 21 2022, 4:41 AM

Address Xiang's comments.

  1. Canonicalize the phi node.
  2. Trace the liveout register as we may miss spill when phi is transformed to tileload and the phi is deleted.
xiangzhangllvm added inline comments.May 23 2022, 2:10 AM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
551

I am afraid , even without call, 1 ldtilecfg is not enough for 1 MBB.
For example
In 1 MBB, it contain 4 shapes, but the first 3 shape used up the "max reg num of ldtilecfg (8)" virtual tile regs in follow code. So It need another ldtilecfg.

565

Not much sure here must no "COPY" for tile. (T1 = Copy T0)
Better to add assert here to exclude unexpected Tile def (MI is not intrinsic).

575

Sorry, In my understand, the comment is not match the code.
it should be

 //   tilezero
 //   def row
 //   def col  <- LastShapeMI 
//   ldtilecfg <- insert
 //   tilezero(row, col)
605

We should escape duplicated reload. for example:

Not re-gen reload for line 2.

1  T0 = TileLoad
2   TileUse T0
LuoYuanke added inline comments.May 23 2022, 5:08 AM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
551

That would run out of register. Currently we have valotiled tile in "lower amx type" pass. We can improve it later and disable volatile tile in that pass.

565

Thre are only phi or copy beside AMX intrinsic. We handle phi separately. "COPY" may be generated after phi elimination. OK, let me add assert here.

575

Let me update the comments.

605

How about optimize reload in another patch?

LuoYuanke updated this revision to Diff 431346.May 23 2022, 5:44 AM

Address Xiang's comments.

xiangzhangllvm added inline comments.May 23 2022, 5:46 PM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
551

If we consider greedy-->fast, we should consider this problem.

576

Copy will cause the wrong action for following get shape.
MI.getOperand(x)

605

No problem.

xiangzhangllvm added inline comments.May 23 2022, 6:15 PM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
551

And do we need to consider shape number > "max reg num of ldtilecfg (8)" in current stage ? It is possible in big BB.

LuoYuanke added inline comments.May 23 2022, 6:36 PM
llvm/lib/Target/X86/X86FastPreTileConfig.cpp
551

That is not supported yet. We can improve it later.

576

Copy is filter out by isTileDef?

xiangzhangllvm added inline comments.May 23 2022, 6:47 PM
llvm/lib/Target/X86/X86TargetMachine.cpp
424

Remove.

LuoYuanke updated this revision to Diff 431561.May 23 2022, 7:37 PM

Address Xiang's comments.

xiangzhangllvm accepted this revision.May 23 2022, 7:41 PM

Mark some TODO for your planning.

llvm/test/CodeGen/X86/AMX/amx-across-func.ll
619

We may need reconfig if tilerelease

This revision is now accepted and ready to land.May 23 2022, 7:41 PM
LuoYuanke added inline comments.May 23 2022, 7:48 PM
llvm/test/CodeGen/X86/AMX/amx-across-func.ll
619

Config is based on function unit. Each function that use AMX should reconfig. If it doesn't touch AMX, no config is required.

LuoYuanke added a comment.EditedMay 23 2022, 10:10 PM

Mark some TODO for your planning.

OK, let me record TODO in the commit message.

This revision was landed with ongoing or failed builds.May 23 2022, 10:44 PM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.May 24 2022, 1:20 AM

FYI this has some noticeable compile-time impact on O0 builds: http://llvm-compile-time-tracker.com/compare.php?from=62a9b36fcf728b104ea87e6eb84c0be69b779df7&to=496156ac57da3abd9c8a6dc422852b7bdfaa448f&stat=instructions

Maybe it's possible to skip more of X86FastPreTileConfig early if tile registers are not used (i.e. almost always)?

FYI this has some noticeable compile-time impact on O0 builds: http://llvm-compile-time-tracker.com/compare.php?from=62a9b36fcf728b104ea87e6eb84c0be69b779df7&to=496156ac57da3abd9c8a6dc422852b7bdfaa448f&stat=instructions

Maybe it's possible to skip more of X86FastPreTileConfig early if tile registers are not used (i.e. almost always)?

Yes, that's a promising fix. Let me come up with a patch for it. Thanks.

FYI this has some noticeable compile-time impact on O0 builds: http://llvm-compile-time-tracker.com/compare.php?from=62a9b36fcf728b104ea87e6eb84c0be69b779df7&to=496156ac57da3abd9c8a6dc422852b7bdfaa448f&stat=instructions

Maybe it's possible to skip more of X86FastPreTileConfig early if tile registers are not used (i.e. almost always)?

Yes, that's a promising fix. Let me come up with a patch for it. Thanks.

Patch https://reviews.llvm.org/D126280 to reduce the compiling time.

LuoYuanke added a comment.EditedJun 27 2022, 8:07 PM

I can duplicated the regression with "llc -mtriple=aarch64 test/DebugInfo/Generic/two-cus-from-same-file.ll -O0 -o -". This is a ISel bug on "DBG_VALUE", I'll file a bug for it.

This comment was removed by LuoYuanke.