Page MenuHomePhabricator

[AMDGPU] Extend the SI Load/Store optimizer to combine more things.
ClosedPublic

Authored by sheredom on Nov 2 2018, 11:27 AM.

Details

Summary

I've extended the load/store optimizer to be able to produce dwordx3 loads and stores, and also enable it to produce dwordx8 and dwordx16 sgpr loads. This change allows many more load/stores to be combined, and results in much more optimal code for our hardware.

Diff Detail

Repository
rL LLVM

Event Timeline

sheredom created this revision.Nov 2 2018, 11:27 AM

I think these cases should mostly be handled by an IR pass to merge load/store intrinsics, and we need to fix handling of 3 element vectors in SelectionDAG

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
112 ↗(On Diff #172406)

No global static initializer

mareko added a subscriber: mareko.Nov 2 2018, 1:39 PM

I'm concerned that x8 and x16 loads will significantly increase SGPR usage and therefore SGPR spilling. We have a shader database with over 70 games and benchmarks and I guess the results will not be good after this is committed.

There is another case that can be optimized: Loading {f32, f32, skip, f32} and {f32, skip, f32, f32}. Those can be done with x4 loads for both scalar and vector instructions. The cost is 1 more used VGPR or SGPR. Also, register allocation may reuse the unused register immediately, which will cause unnecessary s_waitcnt after the load and may hurt us.

arsenm added a comment.Nov 2 2018, 1:41 PM

My original intent with this pass was to handle the non-adjacent DS writes, with a goal of someday merging to x8 and x16 loads when known from better register pressure information at this later point

sheredom updated this revision to Diff 172729.Nov 6 2018, 2:39 AM

We discussed this on an internal AMD meeting Monday 5th November 2018, and came to the conclusion that even though I do want the scalar load combining to be brought upstream, it would be better as a separate change so that we can get broader testing across the users of our AMDGPU backend.

This change has been reduced to only allow production of dwordx3, and combining of x3 to turn it into x4.

The huge switch statements are a poster child for the generic SearchableTables, somewhat analogous to what already exists for MIMGInstructions. Sketching it out:

class LoadStoreBaseOpcode {
  LoadStoreBaseOpcode BaseOpcode = !cast<LoadStoreBaseOpcode>(NAME);
  bit Srsrc;
  bit Sbase;
  ...
}

def LoadStoreBaseOpcode : GenericEnum {
  let FilterClass = "LoadStoreBaseOpcode";
}

class LoadStoreOpcode {
  Instruction Opcode;
  LoadStoreBaseOpcode BaseOpcode;
  bits<8> Width;
}

def LoadStoreOpcodeTable : GenericTable {
  let FilterClass = "LoadStoreOpcode";
  let CppTypeName = "LoadStoreOpcode";
  let Fields = ["Opcode", "BaseOpcode", "Width"];
  GenericEnum TypeOf_BaseOpcode = LoadStoreBaseOpcode;

  let PrimaryKey = ["BaseOpcode", "Width"];
  let PrimaryKeyName = "getLoadStoreOpcode";
} 

... and so on ...

Not a complete review yet, but I need to sign off.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
79 ↗(On Diff #172729)

Why is this needed?

83 ↗(On Diff #172729)

Do those actually occur like this in practice?

sheredom added inline comments.Nov 8 2018, 9:25 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
79 ↗(On Diff #172729)

I use it for the cases where the intrinsic does not match any of the intrinsics that we can do optimizations on.

83 ↗(On Diff #172729)

Do you mean do I see idxen loads in workloads? Yup!

nhaehnle added inline comments.Nov 8 2018, 10:18 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
83 ↗(On Diff #172729)

Mergeable ones? What pattern in the high-level source does that correspond to?

sheredom updated this revision to Diff 176566.Dec 4 2018, 2:11 AM

Fixed review comments:

  • Made the switch statements autogenerated by BUFInstructions.td tablegen.
  • Removed idxen mappings (will try and add them back in later in their own commit).
This revision is now accepted and ready to land.Dec 12 2018, 2:12 AM
This revision was automatically updated to reflect the committed changes.