This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] intrintrics for byte/short load/store
ClosedPublic

Authored by rtaylor on Feb 3 2018, 10:35 AM.

Details

Reviewers
mareko
arsenm
nhaehnle
timcorringham
Group Reviewers
Restricted Project
Summary

Added intrinsics for the instructions:

  • buffer_load_ubyte
  • buffer_load_ushort
  • buffer_store_byte
  • buffer_store_short

Added test cases to existing buffer load/store tests.

Diff Detail

Event Timeline

timcorringham created this revision.Feb 3 2018, 10:35 AM
timcorringham added a reviewer: Restricted Project.Feb 3 2018, 10:44 AM
arsenm requested changes to this revision.Feb 3 2018, 11:01 AM

I don't think we need intrinsics for these. At most we should add a mangled type to the existing buffer intrinsics.

include/llvm/IR/IntrinsicsAMDGPU.td
822

float return type doesn't make sense

This revision now requires changes to proceed.Feb 3 2018, 11:01 AM
tpr added a comment.Feb 5 2018, 1:03 AM

Matt, the instructions zero extend the data to i32, so the return type of the int, ushort and ubyte variants are the same, and overloading would not work.

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

mareko added a comment.Feb 5 2018, 7:18 AM

It should be easy to change the return type to i32.

mareko added a comment.Feb 5 2018, 7:18 AM

and, of course, the vdata type.

arsenm added a comment.Feb 5 2018, 8:04 AM

Couldn't we also optimize the loads at least based on used bits like a normal load?

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

Yes, they do but the intrinsic doesn't need to care.

tpr added a comment.Feb 6 2018, 2:19 PM

If we define an overloaded intrinsic with a return type of i8, and the IR using it wants the value zero extended to i32, the frontend would then have to emit a separate zext. I guess we could optimize that to the zero-extending instruction in instruction selection, but wouldn't it be better to have the intrinsic match what the ISA instruction does by returning the zero extended i32?

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Well we can, but using the loads without conversion can be faster? See https://gpuopen.com/gcn-memory-coalescing/ :

  • 32-bit (or smaller) single-channel buffer loads / wave = 4 clocks (under specific cases)
  • 32-bit (or smaller) filtered texels / wave = 16 clocks
rtaylor updated this revision to Diff 187824.Feb 21 2019, 10:53 AM
rtaylor added a subscriber: rtaylor.

Updating to include requested changes

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2019, 10:53 AM
arsenm added inline comments.Feb 21 2019, 5:15 PM
lib/Target/AMDGPU/SIISelLowering.cpp
5611–5619

You shouldn't be inspecting the users. You can just unconditionally use one or the other. You're going to have to insert a truncate back to the original type at the end anyway. You can then add a separate optimization to fold in the sext_inreg or mask into the buffer like is done for loads

rtaylor added inline comments.Feb 22 2019, 7:30 AM
lib/Target/AMDGPU/SIISelLowering.cpp
5611–5619

There are four potential options so what do you mean by one or the other? There is BUFFER_LOAD_ubyte/ushort/short/byte for the Opc.

arsenm added inline comments.Feb 22 2019, 7:49 AM
lib/Target/AMDGPU/SIISelLowering.cpp
5611–5619

You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a separate optimization on a sext (or more likely a sext_inreg)

rtaylor added inline comments.Feb 22 2019, 8:02 AM
lib/Target/AMDGPU/SIISelLowering.cpp
5611–5619

So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use of the existing multiclass without changes.

I think I remember there being a reason Nicolai and I decided not to do this.

arsenm added inline comments.Feb 26 2019, 7:56 AM
lib/Target/AMDGPU/SIISelLowering.cpp
5611–5619

This doesn't change the selection. This is an optimization done in the DAGCombiner

rtaylor updated this revision to Diff 188597.Feb 27 2019, 12:09 PM

Request changes

rtaylor updated this revision to Diff 188598.Feb 27 2019, 12:17 PM

Rename function to better reflect what it does

arsenm added inline comments.Mar 5 2019, 8:08 AM
lib/Target/AMDGPU/SIISelLowering.cpp
5605–5609

Ternary operator

5635

Capitalize

5636–5637

This looks identical to the other part, which is kind of surprising to me but this should be factored into something common

5639

This comment can be removed

5670

Repeated again

7782

Formatting

7785

Should have a hasOneUse check

7786

Leftover debugging

7787

This is missing a check on the source type. If you want to be fancier, you can split out the remainder bits into a new sign extend but there probably isn't much reason to

7792–7794

"will be set by" part doesn't make sense here

test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll
275

The base test case shouldn't have a extend of the use and directly use the value. You should also have one with an explicit zext

307

Could use a testcase with a second non-extended use

rtaylor added inline comments.Mar 5 2019, 8:44 AM
lib/Target/AMDGPU/SIISelLowering.cpp
5636–5637

Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct, raw, normal) And that these should be factored into something common (ie function call)?

7785

You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right?

7787

Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a BUFFER_LOAD_XXX. I'm not sure we need a redundant check here do we?

arsenm added inline comments.Mar 5 2019, 9:02 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7785

No, the buffer operation. If there are multiple uses you will end up creating multiple loads

7787

The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can test this with something like
%load = call llvm.amdgcn.buffer.load.i8()
%ext = zext i8 %load to i32
%shl = shl i32 %ext, 27
%shr = ashr i32 %shl, 27

There will need more shifts to clear the extra bits in the loaded value

rtaylor added inline comments.Mar 11 2019, 10:05 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7787

This should produce a buffer_load_sbyte right? That is what it does currently.

arsenm added inline comments.Mar 11 2019, 1:50 PM
lib/Target/AMDGPU/SIISelLowering.cpp
7787

But it needs additional shifts even after. Right now you'll not be clearing the extra bits in the low 8 that need to be

arsenm added inline comments.Mar 11 2019, 1:53 PM
lib/Target/AMDGPU/SIISelLowering.cpp
7787

You can either leave it as
x = buffer_load_ubyte
sext_inreg x, i27

or
x = buffer_load_sbyte
sext_inreg x, i5

I'm not sure there's much practical difference between them

rtaylor added inline comments.Mar 11 2019, 1:57 PM
lib/Target/AMDGPU/SIISelLowering.cpp
7787

Right, I don't think there is, I'm working on doing the former. Thanks.

arsenm added inline comments.Mar 11 2019, 2:04 PM
lib/Target/AMDGPU/SIISelLowering.cpp
7787

There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks that will be emitted for BFI might be more likely to be reusable

rtaylor commandeered this revision.Mar 12 2019, 9:50 AM
rtaylor added a reviewer: timcorringham.

Changing ownership

rtaylor updated this revision to Diff 190286.Mar 12 2019, 9:50 AM

Add requested changes

Mostly LGTM except some nits. The major one is avoiding the repeated lowering code for each of these cases

lib/Target/AMDGPU/SIISelLowering.cpp
5601

Capitalize

5608

You can just hardcoded this to MVT::Other

5635

Capitalize

5636–5637

Yes

5640–5643

Ternary operator

6283–6286

Ternary operator

6309

Capitalize

6315–6318

Ternary operator

7789

Extra space before ==

7791

Extra space before ==

rtaylor updated this revision to Diff 190349.Mar 12 2019, 3:29 PM

Requested Changes

arsenm accepted this revision.Mar 18 2019, 7:37 AM

LGTM except formatting

lib/Target/AMDGPU/SIISelLowering.cpp
6421

Brace placement

6440

Brace placement

This revision is now accepted and ready to land.Mar 18 2019, 7:37 AM
rtaylor closed this revision.Mar 20 2019, 7:11 AM