Page MenuHomePhabricator

[X86] Support AMX fast register allocation
ClosedPublic

Authored by xiangzhangllvm on Apr 7 2021, 1:59 AM.

Details

Summary

[X86] Support AMX fast register allocation
The amx programming model that discussed in llvm-dev
(http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
The amx fast register allocation that discussed in llvm-dev
(https://lists.llvm.org/pipermail/llvm-dev/2021-April/149770.html)
1>
In O0 level, for the customers usually means clang –O0 –S/-c (Front End and Back end both compile in O0 level):
The tile data of amx intrinsic must be loaded before uses, and store into mem after define a tile register.
Some like

----------------------------------------------------------------------
%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)
%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)
%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)
%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3)    // key amx intrinsic
call void @llvm.x86.tilestored64.internal(... td)
----------------------------------------------------------------------

Because the life range of tile register is very short (from tileload to tilestore, impossible to spill), we let fast register allocation directly allocate tile registers for them.

As the AMX programming model above show, we need ldtilecfg for each tile register before using them.
So we insert ldtilecfg for every key amx intrinsic (There are 2 reasons do it:
1,we don't much care about the performance at O0. 2,The shapes are very hard to compare at O0 level )
e.g.

----------------------------------------------------------------------
%cfgmem = alloca <16 x i32>, align 4
store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem
call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem)
---------------------------------------------------------------------
%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)
%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)
%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)
%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3)    // key amx intrinsic
call void @llvm.x86.tilestored64.internal(... td)
-------------------------------------------------------------------------

But the ldtilecfg need to write the shapes of tile register in its config mem, then we write the shapes before fast register allocation. (it is trouble to do it after register allocation, because the shapes register relocated for AMXinstrinsics may not live at writing position.) But currently, we don’t know for which physic tile register we write the virtual register of shapes ,(because it is before register allocation). So, we just orderly write these shapes into config memory:
e.g.

----------------------------------------------------------------------
%cfgmem = alloca <16 x i32>, align 4                                 * allocate mem
store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem       * zero init
...
//pre-config shape of %t1                                                     *
store volatile i8 %m, i8* %amx.tmm.0.shape.row, align 1     *
store volatile i16 %k, i16* %amx.tmm.0.shape.col, align 2    * pre-config
// pre-config shape of %t2                                                    * shapes
store volatile i8 %k, i8* %amx.tmm.1.shape.row, align 1       *
store volatile i16 %n, i16* %amx.tmm.1.shape.col, align 2    *
// pre-config shape of %t3, %td                                            *
            ….
call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem)              * tile config
-------------------------------------------------------------------------

And then adjust them after fast register allocation.
e.g.
We supposed written the first shape into %amx.tmm.0.shape.row (base + 48), but after fast register allocation if we find the first shape is not corresponding to the first tile register (tmm0), it is corresponding to the 2nd tile register (tmm1), we will adjust the written mem to %amx.tmm.1.shape.row (base + 48 +1).

---------------------------------------------------------------------------
MOV8mi %stack.5, 1, $noreg, 49, $noreg, 8 :: (volatile store 1 into %ir.amx.tmm.0.shape.row)
MOV16mr %stack.5, 1, $noreg, 18, $noreg, renamable $cx :: (volatile store 2 into %ir.amx.tmm.0.shape.col)
     …
PLDTILECFGV killed renamable $rsi, 1, $noreg, 0, $noreg
--------------------------------------------------------------------------

2>
For the customers, they usually use clang –O0 –S/-c (Front End and Back end both compile in O0 level).
But for llvm developers, we may usually let Front End build with O1/2/… and Back End build in O0 (e.g.: clang –O0 –S –emit-llvm + llc –O0)

Considering this way is not the main way of building program and let the upper algorithm works too, I “volatiles” the tile data of key AMX intrinsic in pass “Lower AMX type for load/store”, just let it like in clang –O0, all tile data of key AMX intrinsic must be loaded before uses, and stored into mem after define a tile register. Because the Back End build it in O0, so here we don’t consider the performance, just care about the correctness.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
pengfei added inline comments.Apr 15 2021, 7:32 AM
llvm/lib/Target/X86/X86PreAMXConfig.cpp
206

You can check Loads[0] directly.

217

Why can't be DPSSD etc.?

llvm/test/CodeGen/X86/AMX/amx-configO0toO0.ll
3

You don't need prefix for the single RUN. The same below.

llvm/test/CodeGen/X86/AMX/amx-configO2toO0.ll
40

You should always set palette = 1 after it.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

But the alignment of store and alloca is not match. You may cause runtime crush due the the alignment.

Address pengfei's comments:

llvm/lib/Target/X86/X86FastTileConfig.cpp
182

Checked at 176, MO must be TilePhysReg, Yes, KeyMI's all tile operands' shapes should be saved.
KeyMI can never be a tilestore, because for a volatile model, tile data in tilestore must comes from tileload.
So the KeyMI prefer tileload. line 154 in getKeyAMXInstr should never happen, I'll replace it with a assert.

193

because we don't know the num of the shapes at first, we may meet the shapes not in order.

198

I think it is impossible in O0

230

Yes, I miss it, thanks!

284

Do you mean materializeTileCfg(MI) here ? It will modify the MBB, that will broken the iterators of MBB

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
56–58

I think that is not sensitive for a static opt.

llvm/lib/Target/X86/X86LowerAMXType.cpp
424

the caller passed IsPHI will make sure it is phi.

428

It can't be nullptr, Current all tile def should comes from IntrinsicInst. (PHI has specially handled)

546

I begin to think this case, I think it should never happened, do we have meet Recursive PHI before? I think the " Recursive PHI" should be PHI which has more than 2 operands.

649–650

Greedy also need this AMX Lower Type pass.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
206

What we care is that there should be only 1 tileload for tilestore.

llvm/test/CodeGen/X86/AMX/amx-configO0toO0.ll
3

you mean --check-prefix=AMX_O0 ? I just thought it is more clear for this test.

xiangzhangllvm added inline comments.Apr 15 2021, 5:44 PM
llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

Make sense, thanks!

xiangzhangllvm marked 12 inline comments as done.Apr 16 2021, 1:11 AM

The patch is big, thanks for Pengfei's review!

llvm/lib/Target/X86/X86PreAMXConfig.cpp
217

// here "be stored" means "be stored into mem" not "be TileStore instruction", I'll refine this comments.

357

it checked at its caller.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

updated in other tests, e.g. llvm/test/CodeGen/X86/AMX/amx-configO2toO0-precfg.ll : line 30

Refine Shape info:
In AMX intrinsics we let Shape = {Row, Col (Bytes) }, but the RealCol = Col / ElementSize. We may use the RealCol as a new Row for other new created AMX intrinsics.

LuoYuanke added inline comments.Apr 19 2021, 8:13 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
91

Do we need to calculate the value in compile time if V is constant?

llvm/lib/Target/X86/X86PreAMXConfig.cpp
70

I don't understand the function. Does it mean only tilezero intrinsic return true?

xiangzhangllvm added inline comments.Apr 19 2021, 8:18 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
91

CreateUDiv already handle it.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
70

And currently tileload too, this function return is "intrinsic only def tile (not use tile)" or not.

pengfei added inline comments.Apr 19 2021, 9:04 PM
llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
1042 ↗(On Diff #338692)

Is this function actually used?

llvm/lib/Target/X86/X86LowerAMXType.cpp
68

Better move it to class X86LowerAMXType

383–385

You can make it be member of X86VolatileTileData. Then you don't need to calculate F here.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
206–208

return preWriteTileCfg(I8Ptr, Cfg, Shapes); ?

325

Does it cause warning in release build?

I'll try build it in release, make sure it no warnings, thanks!

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
1042 ↗(On Diff #338692)

I think I mix with the same named function "bool addPostFastRegAllocRewrite( )", thank you!

llvm/lib/Target/X86/X86LowerAMXType.cpp
68

I thought it before, here I just won't to pass it into static function (getRowFromCol).

383–385

this is in static function, not function of X86VolatileTileData.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
206–208

Let's return Init0, cover the assert(Init0 && ...), thanks!

325

I think yes, good catch, thanks!

xiangzhangllvm marked 2 inline comments as done.Apr 20 2021, 12:44 AM

Refined: No new warning in Release build.

llvm/lib/Target/X86/X86FastTileConfig.cpp
298

Should this return true based on tilecfg rewrite?

llvm/lib/Target/X86/X86LowerAMXType.cpp
426

Coding style: Use {} when if uses it.

648–650

Remove unused code.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
268

Maybe better to use BasicBlock::iterator &, then you don't need to return it.

282

Coding style: Use {} when the else uses it.

372–374

Coding style: Don't use {} for single line.

pengfei added inline comments.Apr 20 2021, 1:42 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
148

Can we always get the shape by the def of a AMX. Then we can avoid to add calculation for it.
We can add assert here to make sure of it.

xiangzhangllvm marked an inline comment as done.Apr 20 2021, 1:49 AM
xiangzhangllvm added inline comments.
llvm/lib/Target/X86/X86LowerAMXType.cpp
148

We just can pass Element Size into the AMX intrinsic def. The calculation still need.

xiangzhangllvm marked 5 inline comments as done.Apr 20 2021, 7:09 PM
xiangzhangllvm added inline comments.
llvm/lib/Target/X86/X86PreAMXConfig.cpp
268

Let I = getShapesAndConfigPosEnd(...) in caller is more readable, In my eye.

pengfei accepted this revision.Apr 20 2021, 7:48 PM

The implementation is good to me in general. Let's wait a few days to see opinions from community and other reviewers.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

Nit: the source lacks palette initialization. It depends on you add it or not.

This revision is now accepted and ready to land.Apr 20 2021, 7:48 PM

Thanks for your reviewing!

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

Yes, here doesn't matter, this test focus on checking rewiting shapes after fast register allocation.
So I didn't write the palette in the test.
In other tests (amx-configO2toO0.ll, amx-configO0toO0.ll) we can see the palette is set to 1.

Use clang-format refine.
And add Fixme: at LowAMXType

xiangzhangllvm added inline comments.Apr 24 2021, 6:41 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
147

And add Fixme: at LowAMXType

This revision was landed with ongoing or failed builds.Apr 24 2021, 6:48 PM
This revision was automatically updated to reflect the committed changes.
bkramer added inline comments.
llvm/lib/Target/X86/X86LowerAMXType.cpp
68

You can't simply have global state here, it doesn't work in a multithreaded environment. I reverted this change in df323ba445f7fc4d29def8950e80dec6ba487961 because it breaks us.

Fix bkramer's multithread problem.

xiangzhangllvm marked an inline comment as done.May 7 2021, 1:07 AM

That's better, thanks!

refine, compatible with old AMX API