This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add intrinsics for tbuffer load and store
ClosedPublic

Authored by dstuttard on Mar 7 2017, 1:44 AM.

Details

Summary

Intrinsic already existed for llvm.SI.tbuffer.store

Needed tbuffer.load and also re-implementing the intrinsic as llvm.amdgcn.tbuffer.*

Added CodeGen tests for the 2 new variants added.
Left the original llvm.SI.tbuffer.store implementation to avoid issues with existing code

Event Timeline

dstuttard created this revision.Mar 7 2017, 1:44 AM
arsenm added inline comments.Mar 7 2017, 4:14 PM
include/llvm/IR/IntrinsicsAMDGPU.td
481

This should be part of the mangled return type, I don't think there's any reason to have to add this here

487–491

I don't think we can directly expose offen/idxen.

tfe definitely can't be exposed this way because it changes the register class of the output

tstellar added inline comments.Mar 7 2017, 4:19 PM
include/llvm/IR/IntrinsicsAMDGPU.td
487–491

We need some solution to deal with swizzled addressing, because these have really complicated clamping rules. The easiest thing to do is to expose all fields so the user is responsible for making sure the operands are correct.

arsenm added inline comments.Mar 7 2017, 4:25 PM
include/llvm/IR/IntrinsicsAMDGPU.td
487–491

I was wondering if we could use fat pointers, and then have a constant argument hint for what the swizzle factor in the resource is so we can match the addressing mode. Would that work?

tstellar added inline comments.Mar 7 2017, 4:26 PM
include/llvm/IR/IntrinsicsAMDGPU.td
487–491

I think so, you could also use different address spaces to indicate swizzling vs non-swizzling.

See inline comments.

Is this worth looking at for now if there is a different re-factor already in progress?

include/llvm/IR/IntrinsicsAMDGPU.td
481

Isn't the issue that the 3 dword variant can't be generated as we can only have 1,2 and 4 dword return types?

487–491

I think I need to look at this again, in particular the fat pointers. This initial implementation was based on extending the pre-existing llvm.SI.tbuffer.store that already existed.
Matt has also pointed out that there is a re-factor of this in progress already (for store). I guess it makes sense to wait for that as well?

arsenm added inline comments.Mar 8 2017, 12:31 PM
include/llvm/IR/IntrinsicsAMDGPU.td
481

We can have a 3x dword return type in the intrinsic. The only issue is the DAG doesn't treat this as a legal machine type. I have patches to fix this, and we can work around this in codegen. Long term this won't be an issue with global isel.

t-tye added a subscriber: t-tye.Mar 22 2017, 6:40 PM
tony-tye removed a subscriber: tony-tye.Mar 22 2017, 6:47 PM
dstuttard updated this revision to Diff 98287.May 9 2017, 7:57 AM

I've rewritten the implementation of tbuffer load and store and based it around buffer.load.format
and buffer.store.format

This has resulted in a cleaner intrinsic that doesn't include all the instruction fields (no enable
bits, no tfe, single index and offset fields that determine what to do based on what is passed in)

No support yet for the 3 dword variants - but this is a step along the way

This doesn't include fat pointer support - something that can be added later when this is rolled out
more widely

I've also removed the old llvm.SI.tbuffer.store implementation and replaced the uses of it in the lit
testing (this part can be reverted if required).

tstellar added inline comments.May 9 2017, 8:02 AM
include/llvm/IR/IntrinsicsAMDGPU.td
494

Will these be able to support swizzled addressing with only one offset field?

dstuttard added inline comments.May 9 2017, 10:03 AM
include/llvm/IR/IntrinsicsAMDGPU.td
494

This implementation mirrors the buffer.load.format and buffer.store.format intrinsics, so presumably they would suffer from the same problem?

You can generate an instruction with both the index and offset enable bits set (and vgpr offset and index), as well as an offset like this:

%offs.2  = add i32 %offs, 52
%vdata   = call <4 x i32> @llvm.amdgcn.tbuffer.load.v4i32(<4 x i32> %0, i32 %vindex, i32 %offs.2, i32 14, i32 4, i1 0, i1 0)

which results in an instruction like this:

	tbuffer_load_format_xyzw v[0:3], v[0:1], s[0:3],  dfmt:14,  nfmt:4, 0 idxen offen offset:52

does that give you everything required for a swizzle?

arsenm added inline comments.May 9 2017, 7:08 PM
lib/Target/AMDGPU/SIISelLowering.cpp
3466–3467

Replacing the uses in the test is fine, but we need to keep this around until mesa is updated to use the new intrinsic

dstuttard updated this revision to Diff 98467.May 10 2017, 8:44 AM

Reintroduce the legacy llvm.SI.tbuffer.store intrinsic

I've left the tests using the new implementation, but have included the old test for this
intrinsic.

In the end the easiest way to do this was to re-insert the old code and rename with "legacy" tags as
appropriate.
Hopefully at some point this can be removed.

tstellar added inline comments.May 10 2017, 9:24 AM
include/llvm/IR/IntrinsicsAMDGPU.td
494

You should review the "Range Checking" section in the ISA docs for all generations. If you arbitrarily move offset values between soffset, inst_offset and vgpr_offset, you risk failing the range checks, which would turn the instruction into a nop.

I think the existing intrinsics are probably broken in the same way, but I think we need to come up with some solution here to
avoid generating broken code.

I think the easiest thing to do is to expose all offset fields, inst_offset, vgpr_offset, and soffset through the intrinsic, and then have the backend not try to optimize the offsets.

We can always go in later and add compiler hints to let the compiler know when it's safe to move the offsets.

dstuttard updated this revision to Diff 98632.May 11 2017, 8:02 AM

As suggested by Tom Stellard (due to potential issues with range checking) , I've changed the
intrinsics to have explicit operands for vindex, voffset, soffset and offset. The backend no longer
attempts to optimise by folding in any offsets it can spot in preceding instructions.

I've used separate vindex and voffset operands which means that the idxen and offen flags aren't
required in the intrinsic itself (and the compiler does the appropriate thing to fold the values
into a reg_sequence when both are used). To me this seems a little cleaner.

A later change (again suggested by Tom) might be to re-enable some of these optimisations with
appropriate compiler hints.

Any chance of another review of this change?
Does it look good to go now?
Thanks

arsenm edited edge metadata.May 23 2017, 10:24 AM

Needs some assembler tests for the new nfmt/dfmt parsing

include/llvm/IR/IntrinsicsAMDGPU.td
479

any_ty, so f32 is a valid type too

lib/Target/AMDGPU/SIISelLowering.cpp
3348–3351

This isn't the right place for this, although it is what other intrinsics are doing right now. As a follow up patch it would be good to move the MMO creation into getTgtMemIntrinsic

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.load.ll
5

Should use GCN check prefix (and space after ;)

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.store.ll
3

Ditto

Data can now be float or int
I decided that since the load variant would support this, so should the store. This meant re-jigging
the implementation slightly to use an approach more similar to load to enable the use of any for the
store as well (unless you can see a more efficient way to do it).

Noted about the MMO creation - I'll look into a later patch for this.

Made suggested changes to the tbuffer.load and tbuffer.store tests.
Added some more tests for the assembler etc. (See extra tests in MC/AMDGPU and
MC/Disassembler/AMDGPU

Forgot to update comment about float as an option for overloading

arsenm added inline comments.May 25 2017, 3:21 AM
lib/Target/AMDGPU/BUFInstructions.td
110

You shouldn't need to change anything about the encoding for the legacy intrinsics. They should purely be an input issue. I don'ts any repeated MTBUF encoding class, so this seems to be an unnecessary rename?

1465–1474

Again we don't need separate machine instruction or encoding definitions to support the old intrinsics

lib/Target/AMDGPU/SIISelLowering.cpp
3535

You shouldn't need a separate node here. You can insert whatever code is necessary to convert to the new intrinsic/node operands are here, such as bitcasting the types or inserting missing constants etc.

See responses to your individual comments - it could be that I've missed something in the way this could be done.

I was reluctant to make big changes to the original implementation (other than renaming) as:

  • I wanted to get the change in relatively quickly
  • I didn't want to risk breaking something in the legacy support that would mean errors creeping in to Mesa implementation for example

I think that the implementation could be simplified if the SIISelLowering code for the legacy intrinsic did the transformation to the new style at that point (as you suggest), but the issue with that is that the new implementation doesn't support the same fields as the old one (TFE, IDXEN, OFFEN). It could be that ignoring TFE is a valid thing to do and won't break anything, and it should be ok to ignore IDXEN and OFFEN, but is that definite?

lib/Target/AMDGPU/BUFInstructions.td
110

Unless I'm missing something - isn't the problem here that the pseudo instructions are using a different approach to representing the offset and idx enable (amongst others).
I was wary of changing the old implementation too much in order to preserve the old mechanism without any risk of breaking it.

lib/Target/AMDGPU/SIISelLowering.cpp
3535

Part of the issue is that the legacy intrinsic has TFE which isn't present in the new one. If TFE is never used it could be done - but it would effectively ignore this field (unless the intrinsic is changed to remove it too).

We could get this lowering code to check that the offen, idxen are correct given the parameters that have been passed in, and perhaps to make sure that tfe is never used (and assert if it is).

Not sure this is necessarily the approach we should take given that the only reason the legacy support is still there is to make sure that code using this approach doesn't break.

arsenm added inline comments.Jun 5 2017, 7:48 AM
lib/Target/AMDGPU/BUFInstructions.td
110

The pseudo instructions here are not related to the operands at all. We define pseudo and real encoding classes for each instruction encoding type because in VI, the instruction encoding changed. The pseudo instructions allow codegen to ignore this detail, and then the MC emission can select the right real encoding opcode at the last possible moment. It is entirely disconnected from the intrinsics.

MUBUF instructions have all the combinations of idxen and offen enabled, this again isn't related to the input intrinsics. There are actually this many different addressing mode options that need to be represented with different instructions, this isn't changing.

dstuttard added inline comments.Jun 6 2017, 8:58 AM
lib/Target/AMDGPU/BUFInstructions.td
110

I misunderstood your original comment. I took it to mean that the legacy and new implementation could be unified at this point.

This is the original implementation of the legacy intrinsic (with the name changed to indicate it is the old implementation), I didn't think there was much point in tidying up the implementation. However, if you think it is worthwhile I'll make the necessary changes (I'm inclined to agree that the legacy code should be shortened if there is an easy option to do so).

dstuttard updated this revision to Diff 102323.Jun 13 2017, 5:59 AM

Removed the bulk of the legacy implementation and now lower to the new form earlier

Some issues to note:

  1. tfe can't be used in the legacy intrinsic (asserts if this is detected). The old implementation

wouldn't have worked if tfe was enabled anyway so this isn't a loss of functionality

  1. idxen and offen can't both be used - this is supported by the new intrinsic, but the legacy

implementation wouldn't have worked in this mode anyway. Rather than re-write the legacy intrinsic
to support 1 or 2 dword VAddr operands (which is what is required) I've just added an assert to
catch this case. Again - this isn't a loss of functionality over the old intrinsic anyway.

  1. The legacy intrinsic supports 3-vec form (TBUFFER_STORE_FORMAT_XYZ) which I've also added in this

updated implementation for the legacy intrinsic. The new intrinsic doesn't support it at the moment
and will have to be added at some point.

LGTM besides formatting issues

lib/Target/AMDGPU/SIISelLowering.cpp
3524

80 column limit (a few other places too)

3530

Variable name styles

dstuttard updated this revision to Diff 103325.Jun 21 2017, 1:50 AM

Corrected 80 column formatting and variable names

dstuttard updated this revision to Diff 103326.Jun 21 2017, 1:52 AM

Inadvertently included extra changes in last diff

arsenm accepted this revision.Jun 21 2017, 3:12 PM

LGTM

This revision is now accepted and ready to land.Jun 21 2017, 3:12 PM
This revision was automatically updated to reflect the committed changes.
test/Transforms/SLPVectorizer/AMDGPU/packed-math.ll