I've extended the load/store optimizer to be able to produce dwordx3 loads and stores, and also enable it to produce dwordx8 and dwordx16 sgpr loads. This change allows many more load/stores to be combined, and results in much more optimal code for our hardware.
Details
Diff Detail
Event Timeline
I think these cases should mostly be handled by an IR pass to merge load/store intrinsics, and we need to fix handling of 3 element vectors in SelectionDAG
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
106 | No global static initializer |
I'm concerned that x8 and x16 loads will significantly increase SGPR usage and therefore SGPR spilling. We have a shader database with over 70 games and benchmarks and I guess the results will not be good after this is committed.
There is another case that can be optimized: Loading {f32, f32, skip, f32} and {f32, skip, f32, f32}. Those can be done with x4 loads for both scalar and vector instructions. The cost is 1 more used VGPR or SGPR. Also, register allocation may reuse the unused register immediately, which will cause unnecessary s_waitcnt after the load and may hurt us.
My original intent with this pass was to handle the non-adjacent DS writes, with a goal of someday merging to x8 and x16 loads when known from better register pressure information at this later point
We discussed this on an internal AMD meeting Monday 5th November 2018, and came to the conclusion that even though I do want the scalar load combining to be brought upstream, it would be better as a separate change so that we can get broader testing across the users of our AMDGPU backend.
This change has been reduced to only allow production of dwordx3, and combining of x3 to turn it into x4.
The huge switch statements are a poster child for the generic SearchableTables, somewhat analogous to what already exists for MIMGInstructions. Sketching it out:
class LoadStoreBaseOpcode { LoadStoreBaseOpcode BaseOpcode = !cast<LoadStoreBaseOpcode>(NAME); bit Srsrc; bit Sbase; ... } def LoadStoreBaseOpcode : GenericEnum { let FilterClass = "LoadStoreBaseOpcode"; } class LoadStoreOpcode { Instruction Opcode; LoadStoreBaseOpcode BaseOpcode; bits<8> Width; } def LoadStoreOpcodeTable : GenericTable { let FilterClass = "LoadStoreOpcode"; let CppTypeName = "LoadStoreOpcode"; let Fields = ["Opcode", "BaseOpcode", "Width"]; GenericEnum TypeOf_BaseOpcode = LoadStoreBaseOpcode; let PrimaryKey = ["BaseOpcode", "Width"]; let PrimaryKeyName = "getLoadStoreOpcode"; } ... and so on ...
Not a complete review yet, but I need to sign off.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
78 | Why is this needed? | |
82 | Do those actually occur like this in practice? |
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
82 | Mergeable ones? What pattern in the high-level source does that correspond to? |
Fixed review comments:
- Made the switch statements autogenerated by BUFInstructions.td tablegen.
- Removed idxen mappings (will try and add them back in later in their own commit).
Why is this needed?