This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: buffer.load.format intrinsic and si-load-shrink pass
AbandonedPublic

Authored by nhaehnle on Oct 20 2015, 6:32 AM.

Details

Summary

This is work in progress, please provide feedback. It supersedes http://reviews.llvm.org/D13586 based on comments there.

At a high-level, the motivation of these changes is:

  1. Add llvm.amdgcn.buffer.load.format intrinsic to expose (almost) the full range of what the hardware can do (minus addr64 mode and D16 variants, both of which should arguably get their own intrinsics).
  1. For both image loads/samples and buffer load, split the (simple) optimization of determining the appropriate size/dmask of the load into an IR-level CodeGenPrepare pass, while the selection of the appropriate machine instruction stays in SIISelLowering/patterns in TableGen files, respectively.

Known issues / questions:

  1. Some regressions in image/sample-related tests. Also, buffer.load.format tests obviously need to be expanded as noted.
  1. It is annoying to match the MIMG-related intrinsics by name at the IR level, but this seems to be necessary as long as those intrinsics are defined in the target .td files, and so no IntrinsicID is assigned.
  1. The big one: the v3f32 variant (BUFFER_LOAD_FORMAT_XYZ) is currently not supported because v3f32 is not a MachineValueType, and the type legalization step of codegen bails out. IMO, the clean solution would be to argue that since amgcn is a real existing target that genuinely has 3-element vector instructions for non-crazy reasons, v3f32 (and v3i32) should be added to MachineValueType. But of course, this is a pretty core change and the image path in SIISelLowering did not go that route and hacked around it instead.

I have not actually looked into an alternative design without modifying MVT. Since at the IR-level the desired size must be represented by the return type of the intrinsic, any hack-around would somehow involved convincing the type legalization step to accept v3f32 (unlike in the image case, where the IR-level always uses v4f32 and implicitly stores the size in the dmask).

Perhaps there is a way of telling the target-independent codegen to accept v3f32s without actually adding it to MVT?

So... going the route of adding v3f32 and v3i32 to MVT is the best path that I can see, but before I do that, I want to get feedback on the plan.

  1. Eventually, Mesa and other clients are intended to emit the buffer.load.format intrinsic directly. In the meantime, SI.load.input should be transformed early on. Where is the best place to do that from a design POV? The SILoadShrink pass is an obvious candidate, but it clashes with the name given to that pass.

Diff Detail

Event Timeline

nhaehnle updated this revision to Diff 37871.Oct 20 2015, 6:32 AM
nhaehnle retitled this revision from to AMDGPU: buffer.load.format intrinsic and si-load-shrink pass.
nhaehnle updated this object.
nhaehnle added reviewers: tstellarAMD, mareko.
mareko edited edge metadata.Oct 20 2015, 8:06 AM

It can be difficult to select the non-vec4 loads in TGSI. There is not enough readily-available information for that there.

ADDR64 loads shouldn't get their own intrinsics, because:

  • They are already matched by the generic loads
  • VI and later don't support them (FLAT opcodes should be used instead, but I think those need more work outside of LLVM)

Also, Mesa doesn't use MUBUF ADDR64.

I wouldn't worry about D16. We don't need it.

include/llvm/IR/IntrinsicsAMDGPU.td
114

A small nit: We define such intrinsics in SIIntrinsics.td

include/llvm/IR/IntrinsicsAMDGPU.td
114

IntrinsicsAMDGPU.td is the correct place for new intrinsics. At some point we should move the all the intrinsics there.

It can be difficult to select the non-vec4 loads in TGSI. There is not enough readily-available information for that there.

True.

The underlying reason for supporting the full range of intrinsics is to be able to implement this optimization as an IR pass as Matt suggested. The fact that it will then theoretically be possible to choose non-vec4 loads in Mesa directly is only a nice side-effect.

ADDR64 loads shouldn't get their own intrinsics, because:

  • They are already matched by the generic loads
  • VI and later don't support them (FLAT opcodes should be used instead, but I think those need more work outside of LLVM)

Also, Mesa doesn't use MUBUF ADDR64.

Ok.

I wouldn't worry about D16. We don't need it.

Ok.

Any thoughts about the v3f32 issue?

arsenm added inline comments.Oct 22 2015, 11:10 AM
include/llvm/IR/IntrinsicsAMDGPU.td
114

Only the supported ones. Backend only ones like the control flow intrinsics should stay private in the backend

Any thoughts about the v3f32 issue?

This is a pretty big problem everywhere. You can use a 3 vector intrinsic type, but it will require some special handling to lower the intrinsic with the artificially illegal type.

Any thoughts about the v3f32 issue?

This is a pretty big problem everywhere. You can use a 3 vector intrinsic type, but it will require some special handling to lower the intrinsic with the artificially illegal type.

By special handling I mean you probably need to directly select the instruction during lowering before the type legalizer tries to do anything with this

nhaehnle abandoned this revision.Feb 21 2018, 6:55 AM