When we fill the shape to tile configure memory, the shape is gotten
from AMX pseudo instruction. However the register for the shape may be
split or spilled by greedy RA. That cause we fill the shape to config
memory after ldtilecfg is executed, so that the shape configuration
would be wrong.
This patch is to split the tile register allocation from greedy register
allocation, so that after tile registers are allocated the shape
registers are still virtual register. The shape register only may be
redefined or multi-defined by phi elimination pass, two address pass.
That doesn't affect tile register configuration.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
I think the spill/split should still cover the shape regs:
Let me first simply remember our previous action about greedy allocation for AMX:
1 We collected shapes (MOs) in allocating tile regs (by hint tile for same shape) in greedy
2 After greedy we insert the "fill" instructions to set the shape to ldtilecfg's mem. (They are still virtual)
3 Then the rewriter assign to physic regs to them.
The order of related passes:
Greedy Register Allocator Verify generated machine code Tile Register Configure Verify generated machine code Virtual Register Rewriter Verify generated machine code Register Allocation Pass Scoring
Example: After Tile Register Configure
96B VMOVUPSZmr %stack.0, 1, $noreg, 0, $noreg, %13:vr512 :: (store (s512) into %stack.0, align 4) 104B MOV8mi %stack.0, 1, $noreg, 0, $noreg, 1 :: (store (s512) into %stack.0, align 4) 112B MOV16mi %stack.0, 1, $noreg, 18, $noreg, 8 :: (store (s512) into %stack.0 + 18, align 2, basealign 4) 116B MOV8mi %stack.0, 1, $noreg, 50, $noreg, 8 :: (store (s512) into %stack.0 + 50, align 2, basealign 4) 124B MOV16mr %stack.0, 1, $noreg, 20, $noreg, %1.sub_16bit:gr32 :: (store (s512) into %stack.0 + 20, align 4) 132B MOV8mr %stack.0, 1, $noreg, 49, $noreg, %0.sub_8bit:gr32 :: (store (s512) into %stack.0 + 49, align 1, basealign 4) 140B MOV16mr %stack.0, 1, $noreg, 16, $noreg, %1.sub_16bit:gr32 :: (store (s512) into %stack.0 + 16, align 4) 148B MOV8mr %stack.0, 1, $noreg, 48, $noreg, %0.sub_8bit:gr32 :: (store (s512) into %stack.0 + 48, align 4) 172B PLDTILECFGV %stack.0, 1, $noreg, 0, $noreg, implicit-def dead $tmm0, xxx
llvm/lib/Target/X86/X86TargetMachine.cpp | ||
---|---|---|
637 | Will the "TargetPassConfig::addRegAssignAndRewriteOptimized()" handle tile register again ? |
In previous code, the tile register and shape register are allocated in the same pass. The shape config is processed before "virtregrewriter". Though when filling shape, it is still virtual register, but the virtual register has been split or spillied.
llvm/lib/Target/X86/X86TargetMachine.cpp | ||
---|---|---|
637 | No. In the 2nd greedy pass, the tile register has been assign to a physical register, so it won't allocate for tile register again. |
OK, I got it, in the Tile Register Configure the virtual reg is "bind" with physic reg now.
It make sense to me now, I'll accept it if no other ops.
It looks like this change had some compile-time impact: http://llvm-compile-time-tracker.com/compare.php?from=fc2d96c334a15d00965eb57a99d49e46728641db&to=5cb09798700aecff1f9f61b7cd80852c61e10fa8&stat=instructions I wonder whether there is any easy way to avoid the overhead if tile registers are not used?
We check the ShouldAllocateClass() in RegAllocBase::enqueue(). The overhead looks small because in the first GreedyRA pass most vritual register is not enqueued yet, but it seems I was wrong. I notice the regession is in O3 build, let me check if any more passes are added unexpectedly in O3
Will the "TargetPassConfig::addRegAssignAndRewriteOptimized()" handle tile register again ?