This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] intrintrics for byte/short load/store
ClosedPublic

Authored by rtaylor on Feb 3 2018, 10:35 AM.

Details

Reviewers
mareko
arsenm
nhaehnle
timcorringham
Group Reviewers
Restricted Project
Summary

Added intrinsics for the instructions:

  • buffer_load_ubyte
  • buffer_load_ushort
  • buffer_store_byte
  • buffer_store_short

Added test cases to existing buffer load/store tests.

Diff Detail

Event Timeline

timcorringham created this revision.Feb 3 2018, 10:35 AM
timcorringham added a reviewer: Restricted Project.Feb 3 2018, 10:44 AM
arsenm requested changes to this revision.Feb 3 2018, 11:01 AM

I don't think we need intrinsics for these. At most we should add a mangled type to the existing buffer intrinsics.

include/llvm/IR/IntrinsicsAMDGPU.td
494

float return type doesn't make sense

This revision now requires changes to proceed.Feb 3 2018, 11:01 AM
tpr added a comment.Feb 5 2018, 1:03 AM

Matt, the instructions zero extend the data to i32, so the return type of the int, ushort and ubyte variants are the same, and overloading would not work.

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

mareko added a comment.Feb 5 2018, 7:18 AM

It should be easy to change the return type to i32.

mareko added a comment.Feb 5 2018, 7:18 AM

and, of course, the vdata type.

arsenm added a comment.Feb 5 2018, 8:04 AM

Couldn't we also optimize the loads at least based on used bits like a normal load?

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

Yes, they do but the intrinsic doesn't need to care.

tpr added a comment.Feb 6 2018, 2:19 PM

If we define an overloaded intrinsic with a return type of i8, and the IR using it wants the value zero extended to i32, the frontend would then have to emit a separate zext. I guess we could optimize that to the zero-extending instruction in instruction selection, but wouldn't it be better to have the intrinsic match what the ISA instruction does by returning the zero extended i32?

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Well we can, but using the loads without conversion can be faster? See https://gpuopen.com/gcn-memory-coalescing/ :

  • 32-bit (or smaller) single-channel buffer loads / wave = 4 clocks (under specific cases)
  • 32-bit (or smaller) filtered texels / wave = 16 clocks
rtaylor updated this revision to Diff 187824.Feb 21 2019, 10:53 AM
rtaylor added a subscriber: rtaylor.

Updating to include requested changes

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2019, 10:53 AM
arsenm added inline comments.Feb 21 2019, 5:15 PM
lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952

You shouldn't be inspecting the users. You can just unconditionally use one or the other. You're going to have to insert a truncate back to the original type at the end anyway. You can then add a separate optimization to fold in the sext_inreg or mask into the buffer like is done for loads

rtaylor added inline comments.Feb 22 2019, 7:30 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952

There are four potential options so what do you mean by one or the other? There is BUFFER_LOAD_ubyte/ushort/short/byte for the Opc.

arsenm added inline comments.Feb 22 2019, 7:49 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952

You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a separate optimization on a sext (or more likely a sext_inreg)

rtaylor added inline comments.Feb 22 2019, 8:02 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952

So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use of the existing multiclass without changes.

I think I remember there being a reason Nicolai and I decided not to do this.

arsenm added inline comments.Feb 26 2019, 7:56 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952

This doesn't change the selection. This is an optimization done in the DAGCombiner

rtaylor updated this revision to Diff 188597.Feb 27 2019, 12:09 PM

Request changes

rtaylor updated this revision to Diff 188598.Feb 27 2019, 12:17 PM

Rename function to better reflect what it does

arsenm added inline comments.Mar 5 2019, 8:08 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4935

Capitalize

4936–4937

This looks identical to the other part, which is kind of surprising to me but this should be factored into something common

4937

Repeated again

4938–4942

Ternary operator

4939

This comment can be removed

6349

Formatting

6352

Should have a hasOneUse check

6353

Leftover debugging

6354

This is missing a check on the source type. If you want to be fancier, you can split out the remainder bits into a new sign extend but there probably isn't much reason to

6359–6361

"will be set by" part doesn't make sense here

test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll
275 ↗(On Diff #188598)

The base test case shouldn't have a extend of the use and directly use the value. You should also have one with an explicit zext

307 ↗(On Diff #188598)

Could use a testcase with a second non-extended use

rtaylor added inline comments.Mar 5 2019, 8:44 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4936–4937

Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct, raw, normal) And that these should be factored into something common (ie function call)?

6352

You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right?

6354

Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a BUFFER_LOAD_XXX. I'm not sure we need a redundant check here do we?

arsenm added inline comments.Mar 5 2019, 9:02 AM
lib/Target/AMDGPU/SIISelLowering.cpp
6352

No, the buffer operation. If there are multiple uses you will end up creating multiple loads

6354

The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can test this with something like
%load = call llvm.amdgcn.buffer.load.i8()
%ext = zext i8 %load to i32
%shl = shl i32 %ext, 27
%shr = ashr i32 %shl, 27

There will need more shifts to clear the extra bits in the loaded value

rtaylor added inline comments.Mar 11 2019, 10:05 AM
lib/Target/AMDGPU/SIISelLowering.cpp
6354

This should produce a buffer_load_sbyte right? That is what it does currently.

arsenm added inline comments.Mar 11 2019, 1:50 PM
lib/Target/AMDGPU/SIISelLowering.cpp
6354

But it needs additional shifts even after. Right now you'll not be clearing the extra bits in the low 8 that need to be

arsenm added inline comments.Mar 11 2019, 1:53 PM
lib/Target/AMDGPU/SIISelLowering.cpp
6354

You can either leave it as
x = buffer_load_ubyte
sext_inreg x, i27

or
x = buffer_load_sbyte
sext_inreg x, i5

I'm not sure there's much practical difference between them

rtaylor added inline comments.Mar 11 2019, 1:57 PM
lib/Target/AMDGPU/SIISelLowering.cpp
6354

Right, I don't think there is, I'm working on doing the former. Thanks.

arsenm added inline comments.Mar 11 2019, 2:04 PM
lib/Target/AMDGPU/SIISelLowering.cpp
6354

There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks that will be emitted for BFI might be more likely to be reusable

rtaylor commandeered this revision.Mar 12 2019, 9:50 AM
rtaylor added a reviewer: timcorringham.

Changing ownership

rtaylor updated this revision to Diff 190286.Mar 12 2019, 9:50 AM

Add requested changes

Mostly LGTM except some nits. The major one is avoiding the repeated lowering code for each of these cases

lib/Target/AMDGPU/SIISelLowering.cpp
4934

Capitalize

4936

Capitalize

4936–4937

Yes

4941

You can just hardcoded this to MVT::Other

4941–4944

Ternary operator

5327–5330

Ternary operator

5341

Capitalize

5347–5350

Ternary operator

6356

Extra space before ==

6358

Extra space before ==

rtaylor updated this revision to Diff 190349.Mar 12 2019, 3:29 PM

Requested Changes

arsenm accepted this revision.Mar 18 2019, 7:37 AM

LGTM except formatting

lib/Target/AMDGPU/SIISelLowering.cpp
5349

Brace placement

5368

Brace placement

This revision is now accepted and ready to land.Mar 18 2019, 7:37 AM
rtaylor closed this revision.Mar 20 2019, 7:11 AM