This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX] Split greedy RA for tile register
ClosedPublic

Authored by LuoYuanke on Jun 25 2022, 7:11 AM.

Details

Summary

When we fill the shape to tile configure memory, the shape is gotten
from AMX pseudo instruction. However the register for the shape may be
split or spilled by greedy RA. That cause we fill the shape to config
memory after ldtilecfg is executed, so that the shape configuration
would be wrong.
This patch is to split the tile register allocation from greedy register
allocation, so that after tile registers are allocated the shape
registers are still virtual register. The shape register only may be
redefined or multi-defined by phi elimination pass, two address pass.
That doesn't affect tile register configuration.

Diff Detail

Event Timeline

LuoYuanke created this revision.Jun 25 2022, 7:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2022, 7:11 AM
LuoYuanke requested review of this revision.Jun 25 2022, 7:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2022, 7:11 AM

I think the spill/split should still cover the shape regs:

Let me first simply remember our previous action about greedy allocation for AMX:

1 We collected shapes (MOs) in allocating tile regs (by hint tile for same shape) in greedy
2 After greedy we insert the "fill" instructions to set the shape to ldtilecfg's mem. (They are still virtual)
3 Then the rewriter assign to physic regs to them.

The order of related passes:

Greedy Register Allocator
Verify generated machine code
Tile Register Configure
Verify generated machine code
Virtual Register Rewriter
Verify generated machine code
Register Allocation Pass Scoring

Example: After Tile Register Configure

96B       VMOVUPSZmr %stack.0, 1, $noreg, 0, $noreg, %13:vr512 :: (store (s512) into %stack.0, align 4)
104B      MOV8mi %stack.0, 1, $noreg, 0, $noreg, 1 :: (store (s512) into %stack.0, align 4)
112B      MOV16mi %stack.0, 1, $noreg, 18, $noreg, 8 :: (store (s512) into %stack.0 + 18, align 2, basealign 4)
116B      MOV8mi %stack.0, 1, $noreg, 50, $noreg, 8 :: (store (s512) into %stack.0 + 50, align 2, basealign 4)
124B      MOV16mr %stack.0, 1, $noreg, 20, $noreg, %1.sub_16bit:gr32 :: (store (s512) into %stack.0 + 20, align 4)
132B      MOV8mr %stack.0, 1, $noreg, 49, $noreg, %0.sub_8bit:gr32 :: (store (s512) into %stack.0 + 49, align 1, basealign 4)
140B      MOV16mr %stack.0, 1, $noreg, 16, $noreg, %1.sub_16bit:gr32 :: (store (s512) into %stack.0 + 16, align 4)
148B      MOV8mr %stack.0, 1, $noreg, 48, $noreg, %0.sub_8bit:gr32 :: (store (s512) into %stack.0 + 48, align 4)
172B      PLDTILECFGV %stack.0, 1, $noreg, 0, $noreg, implicit-def dead $tmm0,  xxx
xiangzhangllvm added inline comments.Jun 27 2022, 7:31 PM
llvm/lib/Target/X86/X86TargetMachine.cpp
637

Will the "TargetPassConfig::addRegAssignAndRewriteOptimized()" handle tile register again ?

I think the spill/split should still cover the shape regs:

Let me first simply remember our previous action about greedy allocation for AMX:

1 We collected shapes (MOs) in allocating tile regs (by hint tile for same shape) in greedy
2 After greedy we insert the "fill" instructions to set the shape to ldtilecfg's mem. (They are still virtual)
3 Then the rewriter assign to physic regs to them.

The order of related passes:

Greedy Register Allocator
Verify generated machine code
Tile Register Configure
Verify generated machine code
Virtual Register Rewriter
Verify generated machine code
Register Allocation Pass Scoring

Example: After Tile Register Configure

96B       VMOVUPSZmr %stack.0, 1, $noreg, 0, $noreg, %13:vr512 :: (store (s512) into %stack.0, align 4)
104B      MOV8mi %stack.0, 1, $noreg, 0, $noreg, 1 :: (store (s512) into %stack.0, align 4)
112B      MOV16mi %stack.0, 1, $noreg, 18, $noreg, 8 :: (store (s512) into %stack.0 + 18, align 2, basealign 4)
116B      MOV8mi %stack.0, 1, $noreg, 50, $noreg, 8 :: (store (s512) into %stack.0 + 50, align 2, basealign 4)
124B      MOV16mr %stack.0, 1, $noreg, 20, $noreg, %1.sub_16bit:gr32 :: (store (s512) into %stack.0 + 20, align 4)
132B      MOV8mr %stack.0, 1, $noreg, 49, $noreg, %0.sub_8bit:gr32 :: (store (s512) into %stack.0 + 49, align 1, basealign 4)
140B      MOV16mr %stack.0, 1, $noreg, 16, $noreg, %1.sub_16bit:gr32 :: (store (s512) into %stack.0 + 16, align 4)
148B      MOV8mr %stack.0, 1, $noreg, 48, $noreg, %0.sub_8bit:gr32 :: (store (s512) into %stack.0 + 48, align 4)
172B      PLDTILECFGV %stack.0, 1, $noreg, 0, $noreg, implicit-def dead $tmm0,  xxx

In previous code, the tile register and shape register are allocated in the same pass. The shape config is processed before "virtregrewriter". Though when filling shape, it is still virtual register, but the virtual register has been split or spillied.

LuoYuanke added inline comments.Jun 27 2022, 8:02 PM
llvm/lib/Target/X86/X86TargetMachine.cpp
637

No. In the 2nd greedy pass, the tile register has been assign to a physical register, so it won't allocate for tile register again.

In previous code, the tile register and shape register are allocated in the same pass. The shape config is processed before "virtregrewriter". Though when filling shape, it is still virtual register, but the virtual register has been split or spillied.

OK, I got it, in the Tile Register Configure the virtual reg is "bind" with physic reg now.

It make sense to me now, I'll accept it if no other ops.

xiangzhangllvm accepted this revision.Jun 28 2022, 5:39 PM
This revision is now accepted and ready to land.Jun 28 2022, 5:39 PM
This revision was landed with ongoing or failed builds.Jun 28 2022, 7:36 PM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.Jun 29 2022, 1:31 AM

It looks like this change had some compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=fc2d96c334a15d00965eb57a99d49e46728641db&to=5cb09798700aecff1f9f61b7cd80852c61e10fa8&stat=instructions I wonder whether there is any easy way to avoid the overhead if tile registers are not used?

LuoYuanke added a comment.EditedJun 29 2022, 2:00 AM

It looks like this change had some compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=fc2d96c334a15d00965eb57a99d49e46728641db&to=5cb09798700aecff1f9f61b7cd80852c61e10fa8&stat=instructions I wonder whether there is any easy way to avoid the overhead if tile registers are not used?

We check the ShouldAllocateClass() in RegAllocBase::enqueue(). The overhead looks small because in the first GreedyRA pass most vritual register is not enqueued yet, but it seems I was wrong. I notice the regession is in O3 build, let me check if any more passes are added unexpectedly in O3

It looks like this change had some compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=fc2d96c334a15d00965eb57a99d49e46728641db&to=5cb09798700aecff1f9f61b7cd80852c61e10fa8&stat=instructions I wonder whether there is any easy way to avoid the overhead if tile registers are not used?

We check the ShouldAllocateClass() in RegAllocBase::enqueue(). The overhead looks small because in the first GreedyRA pass most vritual register is not enqueued yet, but it seems I was wrong. I notice the regession is in O3 build, let me check if any more passes are added unexpectedly in O3

Let me check ShouldAllocateClass() earlier to see if it can fix the regression.

@nikic , could you help to check if D128804 can fix the regression?