This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX][fastalloc] Allocate tile register based on its shape.
Needs ReviewPublic

Authored by LuoYuanke on May 14 2022, 2:49 AM.

Details

Summary

Allocating tile register in a separate pass have 3 benifits.

  1. When spill tile register in fast register allocation, it would create

a virtual register for stride. The instruction that def the virtual
register is inserted after current instruction which is allocating
physical register, so fast register allocation won't access the new
instrution and the virtual register won't be allocated with physical
register.

  1. When fill the shape information of tile configure, we need to know

the tile physical register. If the shape (row, column) is allocated with
physical register, the physical register may be re-defined before
writing it to stack. The re-def may happen during split or spill
registers. If the shape is in virtual register, there is no such
problem.
%al = 1
mov %al, %bl
%al = 2
store stack.cfg, %al ; config
ldtilecfg

  1. Single configure may fail. In that case we can fallback to

multi-config and allocate tile register seperately, while leave other
registers be allocated by greedy RA.

Diff Detail

Event Timeline

LuoYuanke created this revision.May 14 2022, 2:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2022, 2:49 AM
LuoYuanke requested review of this revision.May 14 2022, 2:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2022, 2:49 AM
LuoYuanke updated this revision to Diff 429587.May 15 2022, 6:39 PM

Fix lit test failure.

LuoYuanke updated this revision to Diff 433044.May 31 2022, 3:03 AM

Distinguish shape when allocating 2D tile register.

LuoYuanke updated this revision to Diff 433062.May 31 2022, 5:58 AM

Add test case for reuse physical tile register.

LuoYuanke updated this revision to Diff 433074.May 31 2022, 6:53 AM

Add test case across config.

There's a lot going on here.

  • Could you extract the ShouldAllocClass fixes for fastregalloc into a separate diff so we can get discuss them separately and get feedback from AMDGPU folks who introduced this and are the major other user of this AFAIK.
  • I am still wrapping my head around tile registers and tile register configs; specifically I wonder if the support for that really needs to be integrated into the generic register allocation code or whether there is a way to materialize the register configurations in a post-pass. For example do you know how the legacy x86 x87-FPU support works, where we use the register allocator to allocate pseudo FP register fp0-fp7 and then use a post-pass in X86FloatingPoint to insert the necessary stack management operations after the fact. I am not saying it's the same problem, but it is an example of an instance where we managed to have the regalloc allocate to some intermediate pseudo registers and adapt to the complications (in that case register stacks) in a target-specific pass so we wouldn't need to introduce the concept of register stacks to the generic code.

There's a lot going on here.

  • Could you extract the ShouldAllocClass fixes for fastregalloc into a separate diff so we can get discuss them separately and get feedback from AMDGPU folks who introduced this and are the major other user of this AFAIK.

Thank you for review. Sure, I'll extract ShouldAllocClass fixes for fastregalloc into a separate patch.

  • I am still wrapping my head around tile registers and tile register configs; specifically I wonder if the support for that really needs to be integrated into the generic register allocation code or whether there is a way to materialize the register configurations in a post-pass. For example do you know how the legacy x86 x87-FPU support works, where we use the register allocator to allocate pseudo FP register fp0-fp7 and then use a post-pass in X86FloatingPoint to insert the necessary stack management operations after the fact. I am not saying it's the same problem, but it is an example of an instance where we managed to have the regalloc allocate to some intermediate pseudo registers and adapt to the complications (in that case register stacks) in a target-specific pass so we wouldn't need to introduce the concept of register stacks to the generic code.

We have pre-pass to insert config instructions and post-pass to fill the shape information of each physical tile register and feed it to config instruction.
I think they are different problem for tile register allocation and x87-FPU register allocation. The problem of x87-FPU register is that register is arranged in stack order in HW and I'm not sure stackify pass is very efficient to convert pseudo instruction to x87 instruction without inserting extra instructions to adjust stack order. The problem for tile registers is introduced by configure. Before accessing any tile register, they should be configured to specify the shape (row, column) information. HW would base on the shape information to operate AMX intruction. The configure instruction would clear all the data of tile registers. If we add a post-fixup pass like stackify pass, we need reconfig registers that is allocated to the same physical tile register but have different shape, and the reconfig would clobber all tile registers. That may generate too many config instruction and spill/reload in post-fix pass. Nevertheless I am happy to work with community for any suggestion to improve the solution for AMX register allocation.

LuoYuanke retitled this revision from [X86][AMX][fastalloc] Allocate tile register separately. to [X86][AMX][fastalloc] Allocate tile register based on its shape..Jun 12 2022, 6:58 PM