This is an archive of the discontinued LLVM Phabricator instance.

[X86] Support AMX fast register allocation
ClosedPublic

Authored by xiangzhangllvm on Apr 7 2021, 1:59 AM.

Details

Summary

[X86] Support AMX fast register allocation
The amx programming model that discussed in llvm-dev
(http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
The amx fast register allocation that discussed in llvm-dev
(https://lists.llvm.org/pipermail/llvm-dev/2021-April/149770.html)
1>
In O0 level, for the customers usually means clang –O0 –S/-c (Front End and Back end both compile in O0 level):
The tile data of amx intrinsic must be loaded before uses, and store into mem after define a tile register.
Some like

----------------------------------------------------------------------
%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)
%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)
%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)
%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3)    // key amx intrinsic
call void @llvm.x86.tilestored64.internal(... td)
----------------------------------------------------------------------

Because the life range of tile register is very short (from tileload to tilestore, impossible to spill), we let fast register allocation directly allocate tile registers for them.

As the AMX programming model above show, we need ldtilecfg for each tile register before using them.
So we insert ldtilecfg for every key amx intrinsic (There are 2 reasons do it:
1,we don't much care about the performance at O0. 2,The shapes are very hard to compare at O0 level )
e.g.

----------------------------------------------------------------------
%cfgmem = alloca <16 x i32>, align 4
store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem
call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem)
---------------------------------------------------------------------
%t1 = call x86_amx @llvm.x86.tileloadd64.internal(m, k, ...)
%t2 = call x86_amx @llvm.x86.tileloadd64.internal(k, n, ...)
%t3 = call x86_amx @llvm.x86.tileloadd64.internal(m, n, ...)
%td = call x86_amx @llvm.x86.tdpbssd.internal(m, n, k, t1, t2, t3)    // key amx intrinsic
call void @llvm.x86.tilestored64.internal(... td)
-------------------------------------------------------------------------

But the ldtilecfg need to write the shapes of tile register in its config mem, then we write the shapes before fast register allocation. (it is trouble to do it after register allocation, because the shapes register relocated for AMXinstrinsics may not live at writing position.) But currently, we don’t know for which physic tile register we write the virtual register of shapes ,(because it is before register allocation). So, we just orderly write these shapes into config memory:
e.g.

----------------------------------------------------------------------
%cfgmem = alloca <16 x i32>, align 4                                 * allocate mem
store <16 x i32> zeroinitializer, <16 x i32>* %cfgmem       * zero init
...
//pre-config shape of %t1                                                     *
store volatile i8 %m, i8* %amx.tmm.0.shape.row, align 1     *
store volatile i16 %k, i16* %amx.tmm.0.shape.col, align 2    * pre-config
// pre-config shape of %t2                                                    * shapes
store volatile i8 %k, i8* %amx.tmm.1.shape.row, align 1       *
store volatile i16 %n, i16* %amx.tmm.1.shape.col, align 2    *
// pre-config shape of %t3, %td                                            *
            ….
call void @llvm.x86.ldtilecfg.internal(i8* %cfgmem)              * tile config
-------------------------------------------------------------------------

And then adjust them after fast register allocation.
e.g.
We supposed written the first shape into %amx.tmm.0.shape.row (base + 48), but after fast register allocation if we find the first shape is not corresponding to the first tile register (tmm0), it is corresponding to the 2nd tile register (tmm1), we will adjust the written mem to %amx.tmm.1.shape.row (base + 48 +1).

---------------------------------------------------------------------------
MOV8mi %stack.5, 1, $noreg, 49, $noreg, 8 :: (volatile store 1 into %ir.amx.tmm.0.shape.row)
MOV16mr %stack.5, 1, $noreg, 18, $noreg, renamable $cx :: (volatile store 2 into %ir.amx.tmm.0.shape.col)
     …
PLDTILECFGV killed renamable $rsi, 1, $noreg, 0, $noreg
--------------------------------------------------------------------------

2>
For the customers, they usually use clang –O0 –S/-c (Front End and Back end both compile in O0 level).
But for llvm developers, we may usually let Front End build with O1/2/… and Back End build in O0 (e.g.: clang –O0 –S –emit-llvm + llc –O0)

Considering this way is not the main way of building program and let the upper algorithm works too, I “volatiles” the tile data of key AMX intrinsic in pass “Lower AMX type for load/store”, just let it like in clang –O0, all tile data of key AMX intrinsic must be loaded before uses, and stored into mem after define a tile register. Because the Back End build it in O0, so here we don’t consider the performance, just care about the correctness.

Diff Detail

Event Timeline

xiangzhangllvm created this revision.Apr 7 2021, 1:59 AM
xiangzhangllvm requested review of this revision.Apr 7 2021, 1:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2021, 1:59 AM
xiangzhangllvm edited the summary of this revision. (Show Details)Apr 7 2021, 2:00 AM
xiangzhangllvm edited the summary of this revision. (Show Details)Apr 12 2021, 11:05 PM
pengfei added inline comments.Apr 13 2021, 11:08 PM
llvm/lib/Target/X86/X86ExpandPseudo.cpp
483 ↗(On Diff #335756)

Maybe don't need pseudo instruction?

llvm/lib/Target/X86/X86FastTileConfig.cpp
27

You can remove it if you don't use it.

29

Same above.

38

Same above.

53–56

You can remove them if you don't use.

64

Naming conversions: function should start with lower case letter.

112–115

You can remove it if you don't require any analysis.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4705 ↗(On Diff #335756)

You can move it to .td file.

llvm/lib/Target/X86/X86LowerAMXType.cpp
543

You may need to exclude debug intrinsics.

551

I maybe not the last phi.

618

Do you also insert a store for load intrinsic?

638

We may need to postpone it after we find a AMX intrinsic.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
61

You may need to exclude debug intrinsics.

174

Should be V512Ty?

194

Better add assert for this case.

292

Better add assert to check I is load.

357

Do we need to check it since the pass is only created under O0?

llvm/test/CodeGen/X86/AMX/amx-configO2toO0.ll
3

Better change one case to use avx or sse to check if stack cleared correctly.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
417

The shapes for tmm0 is ax and cx, but the stored shape in stack is $sil and 8?

442

I don't find where we shore this shape.

llvm/utils/gn/secondary/llvm/lib/Target/X86/BUILD.gn
114

Why don't add X86FastTileConfig?

Address Pengfei's comments:

llvm/lib/Target/X86/X86ExpandPseudo.cpp
483 ↗(On Diff #335756)

Yes, I'll change it (duo to some history I used pseudo instruction.)

llvm/lib/Target/X86/X86LowerAMXType.cpp
543

PHI's operands shouldn't be debug intrinsics.

551

Here no need domination relation, I'll remove it, (for some history reason, I put allocation instruction in dominated BB)

618

If the tileload comes from user's source code, O0 or my patch will generate for it.
We shouldn't generate tilestore for auto generated tileload.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
61

debug IRs are not IntrinsicInst.

174

we set the cfg mem align to 4 Bytes before, so here sync with it.

292

BB.end() is possible.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
417

Here is a mistake, I'll fix it, thanks!

442

above mov8/16* are pre-generated before register allocation, they will adjust their store position after this pass tested.

Just one thought: Is there a case user specifies --regalloc=fast with O2 in llc? Can we handle this case?

Thanks
Pengfei

We didn't cover this case currently, also O0 + greedy is not support.

xiangzhangllvm marked 7 inline comments as done.Apr 15 2021, 4:40 AM
xiangzhangllvm added inline comments.
llvm/lib/Target/X86/X86PreAMXConfig.cpp
194

Multi-TileStores of tile definition is not fit current Volatile Model, it will failed in assert.

llvm/test/CodeGen/X86/AMX/amx-configO2toO0.ll
3

I'll add it, thanks!

pengfei added inline comments.Apr 15 2021, 7:32 AM
llvm/lib/CodeGen/TargetPassConfig.cpp
1321

Will this affect other targets? I think being called by FastRA might not be expected for them.

llvm/lib/Target/X86/X86FastTileConfig.cpp
45–46

A bit strange leave them uninitialized.

132

You should exclude debug MI here.

135

Extra spaces.

182

Do you need to check MO's register class? If the KeyMI is store, you will save its tile to shape?

193

If the number is consecutive, use SmallVector should have better performance.

198

Is it possible the shapes are not in current BB? E.g. The previous BB been split etc.

230

Should palette = 1?

273

Nit. Maybe better define it return void and check ShapedTiles.size() in assert.

284

I think it's better to collect the shape config here.

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
56–58

You can move it into the namespace. By the way, clang-format.

llvm/lib/Target/X86/X86LowerAMXType.cpp
20

necessary

411

Better use cast to help check for failures.

415

Will II be nullptr?

533

Can we handle the phi used by another phi?

639

Is this necessary for O0?

pengfei added inline comments.Apr 15 2021, 7:32 AM
llvm/lib/Target/X86/X86PreAMXConfig.cpp
206

You can check Loads[0] directly.

217

Why can't be DPSSD etc.?

llvm/test/CodeGen/X86/AMX/amx-configO0toO0.ll
3

You don't need prefix for the single RUN. The same below.

llvm/test/CodeGen/X86/AMX/amx-configO2toO0.ll
40

You should always set palette = 1 after it.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

But the alignment of store and alloca is not match. You may cause runtime crush due the the alignment.

Address pengfei's comments:

llvm/lib/Target/X86/X86FastTileConfig.cpp
182

Checked at 176, MO must be TilePhysReg, Yes, KeyMI's all tile operands' shapes should be saved.
KeyMI can never be a tilestore, because for a volatile model, tile data in tilestore must comes from tileload.
So the KeyMI prefer tileload. line 154 in getKeyAMXInstr should never happen, I'll replace it with a assert.

193

because we don't know the num of the shapes at first, we may meet the shapes not in order.

198

I think it is impossible in O0

230

Yes, I miss it, thanks!

284

Do you mean materializeTileCfg(MI) here ? It will modify the MBB, that will broken the iterators of MBB

llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp
56–58

I think that is not sensitive for a static opt.

llvm/lib/Target/X86/X86LowerAMXType.cpp
411

the caller passed IsPHI will make sure it is phi.

415

It can't be nullptr, Current all tile def should comes from IntrinsicInst. (PHI has specially handled)

533

I begin to think this case, I think it should never happened, do we have meet Recursive PHI before? I think the " Recursive PHI" should be PHI which has more than 2 operands.

639

Greedy also need this AMX Lower Type pass.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
206

What we care is that there should be only 1 tileload for tilestore.

llvm/test/CodeGen/X86/AMX/amx-configO0toO0.ll
3

you mean --check-prefix=AMX_O0 ? I just thought it is more clear for this test.

xiangzhangllvm added inline comments.Apr 15 2021, 5:44 PM
llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

Make sense, thanks!

xiangzhangllvm marked 12 inline comments as done.Apr 16 2021, 1:11 AM

The patch is big, thanks for Pengfei's review!

llvm/lib/Target/X86/X86PreAMXConfig.cpp
217

// here "be stored" means "be stored into mem" not "be TileStore instruction", I'll refine this comments.

357

it checked at its caller.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

updated in other tests, e.g. llvm/test/CodeGen/X86/AMX/amx-configO2toO0-precfg.ll : line 30

Refine Shape info:
In AMX intrinsics we let Shape = {Row, Col (Bytes) }, but the RealCol = Col / ElementSize. We may use the RealCol as a new Row for other new created AMX intrinsics.

LuoYuanke added inline comments.Apr 19 2021, 8:13 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
96

Do we need to calculate the value in compile time if V is constant?

llvm/lib/Target/X86/X86PreAMXConfig.cpp
69

I don't understand the function. Does it mean only tilezero intrinsic return true?

xiangzhangllvm added inline comments.Apr 19 2021, 8:18 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
96

CreateUDiv already handle it.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
69

And currently tileload too, this function return is "intrinsic only def tile (not use tile)" or not.

pengfei added inline comments.Apr 19 2021, 9:04 PM
llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
1042

Is this function actually used?

llvm/lib/Target/X86/X86LowerAMXType.cpp
68

Better move it to class X86LowerAMXType

370–372

You can make it be member of X86VolatileTileData. Then you don't need to calculate F here.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
205–207

return preWriteTileCfg(I8Ptr, Cfg, Shapes); ?

324

Does it cause warning in release build?

I'll try build it in release, make sure it no warnings, thanks!

llvm/include/llvm/CodeGen/CodeGenPassBuilder.h
1042

I think I mix with the same named function "bool addPostFastRegAllocRewrite( )", thank you!

llvm/lib/Target/X86/X86LowerAMXType.cpp
68

I thought it before, here I just won't to pass it into static function (getRowFromCol).

370–372

this is in static function, not function of X86VolatileTileData.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
205–207

Let's return Init0, cover the assert(Init0 && ...), thanks!

324

I think yes, good catch, thanks!

xiangzhangllvm marked 2 inline comments as done.Apr 20 2021, 12:44 AM

Refined: No new warning in Release build.

llvm/lib/Target/X86/X86FastTileConfig.cpp
298

Should this return true based on tilecfg rewrite?

llvm/lib/Target/X86/X86LowerAMXType.cpp
413

Coding style: Use {} when if uses it.

635–639

Remove unused code.

llvm/lib/Target/X86/X86PreAMXConfig.cpp
268

Maybe better to use BasicBlock::iterator &, then you don't need to return it.

282

Coding style: Use {} when the else uses it.

372–374

Coding style: Don't use {} for single line.

pengfei added inline comments.Apr 20 2021, 1:42 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
131

Can we always get the shape by the def of a AMX. Then we can avoid to add calculation for it.
We can add assert here to make sure of it.

xiangzhangllvm marked an inline comment as done.Apr 20 2021, 1:49 AM
xiangzhangllvm added inline comments.
llvm/lib/Target/X86/X86LowerAMXType.cpp
131

We just can pass Element Size into the AMX intrinsic def. The calculation still need.

xiangzhangllvm marked 5 inline comments as done.Apr 20 2021, 7:09 PM
xiangzhangllvm added inline comments.
llvm/lib/Target/X86/X86PreAMXConfig.cpp
268

Let I = getShapesAndConfigPosEnd(...) in caller is more readable, In my eye.

pengfei accepted this revision.Apr 20 2021, 7:48 PM

The implementation is good to me in general. Let's wait a few days to see opinions from community and other reviewers.

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

Nit: the source lacks palette initialization. It depends on you add it or not.

This revision is now accepted and ready to land.Apr 20 2021, 7:48 PM

Thanks for your reviewing!

llvm/test/CodeGen/X86/AMX/amx-fast-tile-config.mir
32

Yes, here doesn't matter, this test focus on checking rewiting shapes after fast register allocation.
So I didn't write the palette in the test.
In other tests (amx-configO2toO0.ll, amx-configO0toO0.ll) we can see the palette is set to 1.

Use clang-format refine.
And add Fixme: at LowAMXType

xiangzhangllvm added inline comments.Apr 24 2021, 6:41 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
130

And add Fixme: at LowAMXType

This revision was landed with ongoing or failed builds.Apr 24 2021, 6:48 PM
This revision was automatically updated to reflect the committed changes.
bkramer added inline comments.
llvm/lib/Target/X86/X86LowerAMXType.cpp
68

You can't simply have global state here, it doesn't work in a multithreaded environment. I reverted this change in df323ba445f7fc4d29def8950e80dec6ba487961 because it breaks us.

Fix bkramer's multithread problem.

xiangzhangllvm marked an inline comment as done.May 7 2021, 1:07 AM

That's better, thanks!

refine, compatible with old AMX API