AMDGPU/SI: Implement d16 support for buffer intrinsics
ClosedPublic

Authored by cfang on Oct 13 2017, 4:06 PM.

Details

Summary

This patch implements buffer_load_format and tbuffer_load_format intrinsics that support half data types.

While types that are not legal currently ( v4f16, for example), we are using ReplaceNodeResults to change the
type and cast it back after customer lowering.

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes
cfang added a comment.Nov 3 2017, 3:38 PM
  1. Define a new feature, HasPackedD16VMem for gfx8.1 and beyond to guard the generation of VMem instructions with D16 bit set.
  2. Put buffer_store intrinsics implementation in the same patch as buffer_loads;
arsenm added inline comments.Nov 6 2017, 2:14 AM
lib/Target/AMDGPU/AMDGPU.td
289 ↗(On Diff #121560)

I think we should invert this, to HasUnpackedD16Mem. It's only the one weird target, the packed layout is the expected one and for every other subtarget.

lib/Target/AMDGPU/BUFInstructions.td
675 ↗(On Diff #121560)

Comment belongs somewhere else. We usually have the true opcode name as all caps, and modifiers like this as lowercase. In this case we should maybe call it _gfx81 instead.

lib/Target/AMDGPU/SIISelLowering.cpp
549 ↗(On Diff #121560)

Since you really want to get the i16 equivalent vector, you can use changeTypeToInteger on the f16 type.

589 ↗(On Diff #121560)

I think you can assume the ABI type alignment for these (at least for the scalar type) not that it probably matters

3268 ↗(On Diff #121560)

s/ChangeResultType/lowerIntrinsicWChain/

3274 ↗(On Diff #121560)

switch over IID

3345 ↗(On Diff #121560)

I'm pretty sure there should be no f32 or FP_ROUND here. This is returning the half format result just expanded into multiple 32-bit registers. They are still half, you need to truncate the int type and bitcast to f16.

4548–4549 ↗(On Diff #121560)

Same thing as with stores

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.format.d16.ll
1 ↗(On Diff #121560)

These should have a common GCN check prefix to avoid the duplicated -LABEL lines

5 ↗(On Diff #121560)

Spaces after ;

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.load.d16.ll
34 ↗(On Diff #121560)

The waitcnts aren't interesting to check. it would be more useful to extract and store the components to see what instructions are required.

cfang added inline comments.Nov 6 2017, 9:54 AM
lib/Target/AMDGPU/SIISelLowering.cpp
549 ↗(On Diff #121560)

For v2f16, I want something that uses 2 registers, so v2i16 doesn't work for me.
What should be the type that specifies the two registers that hold the two half values? Here I am using v2f32.

3345 ↗(On Diff #121560)

Do you mean the returned type is not v2f32? so what is the type that returned? v2f16? I just could not understand how to represent the two half values in two registers!

I hope you can be more specific on what to do here.

cfang marked 11 inline comments as done.Nov 7 2017, 11:36 AM
cfang added inline comments.
lib/Target/AMDGPU/AMDGPU.td
289 ↗(On Diff #121560)

OK. Use HasUnpackedD16VMem.

lib/Target/AMDGPU/BUFInstructions.td
675 ↗(On Diff #121560)

Done. Use _gfx80 because only gfx80 has the feature "HasUnpackedD16VMem".

lib/Target/AMDGPU/SIISelLowering.cpp
549 ↗(On Diff #121560)

Right. So Remove this function definition.

589 ↗(On Diff #121560)

Not sure how to get the ABI type alignment here since we are using nullptr as ptrval.

3268 ↗(On Diff #121560)

Changed func name to lowerIntrinsicWChain.

3274 ↗(On Diff #121560)

OK switch.

3345 ↗(On Diff #121560)

For v2f16, changed the intrinsic load type to v2i32 if the target has unpacked vmem instructions. After the load,
we truncate the data to v2i16, and then bitcast it back to v2f16. v4f16 case is done similarly.

4548–4549 ↗(On Diff #121560)

For store of v2f16, we first bitcast the data to v2i16, and then zero_extend it to v2i32 to store. v4f16 case is done similarly.

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.format.d16.ll
5 ↗(On Diff #121560)

Thanks.

cfang updated this revision to Diff 121946.Nov 7 2017, 11:37 AM
cfang marked 7 inline comments as done.

Update based on Matt's comments.

Pardon my ignorance, but why isn't include/llvm/IR/IntrinsicsAMDGPU.td being updated?

Pardon my ignorance, but why isn't include/llvm/IR/IntrinsicsAMDGPU.td being updated?

The intrinsics use overloading based on the type, so no new intrinsics need to be defined.

lib/Target/AMDGPU/BUFInstructions.td
1464–1475 ↗(On Diff #121946)

Are the pattens with SItbuffer_load and 16-bit types ever used? I believe the f16 maybe, but isn't the v2f16 getting replaced by ReplaceNodeResults?

I should add that apart from the one comment I had, the change looks good to me. It would be great if we could avoid those added custom SD nodes somehow, though...

cfang added a comment.Nov 22 2017, 1:30 PM
lib/Target/AMDGPU/BUFInstructions.td
1464–1475 ↗(On Diff #121946)

ReplaceNodeResults will only replace illegal vector types (v2f16 on gfx8 and v4f16 on gfx8+).
So the pattens with SItbuffer_load and v2f16 are used for gfx9+ (and f16 are used for gfx8+).

cfang updated this revision to Diff 127023.Dec 14 2017, 2:09 PM
  1. Merge the patch with LLVM trunk.
  2. Update LIT tests to avoid specific registers.
cfang updated this revision to Diff 127152.Dec 15 2017, 9:33 AM

Merge with the latest LLVM trunk.

We need to keep the ball rolling! Please advice what else to do in order to move ahead. Thanks.

arsenm added inline comments.Dec 15 2017, 9:55 AM
lib/Target/AMDGPU/SIISelLowering.cpp
556–561 ↗(On Diff #127152)

You don't need this, you can just do VT.getScalarType()

3333 ↗(On Diff #127152)

I'm confused because all of the intrinsic logic isn't contained here, and this also just always uses the d16 version which is wrong

3412 ↗(On Diff #127152)

The only use of this is the negated condition, so just don't negate it

4390–4393 ↗(On Diff #127152)

This is an unrelated change

4711 ↗(On Diff #127152)

This is redundant with the other type check, you really want isVector() and to just assert on v3f16

4720 ↗(On Diff #127152)

Missing space before //

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.format.d16.ll
1 ↗(On Diff #127152)

Use -enable-var-scope

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.load.d16.ll
1 ↗(On Diff #127152)

-enable-var-scope

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.store.d16.ll
1 ↗(On Diff #127152)

-enable-var-scope

cfang marked 6 inline comments as done.Dec 15 2017, 12:17 PM
cfang added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
3333 ↗(On Diff #127152)

SDValue Res = lowerIntrinsicWChain(Op, EquivLoadVT, DAG)

This is just a help function to replace the result type of "Op" with " EquivLoadVT". maybe we should use a different name to avoid confusion.

What about: ReplaceResultType ? Thanks.

4390–4393 ↗(On Diff #127152)

Why do you think this is unrelated? I have to handle Intrinsic::amdgcn_buffer_load_format in getTgtMemIntrinsic.
After return "true" there, I think it is better to make appropriate change here also.

4711 ↗(On Diff #127152)

Do you think we need to assert for v3f16? The code in this block has no reason not to handle v3f16. So the assert of v3f16, if any, should be somewhere else. Also MVY::v3f16 is not recognized in LC.

cfang updated this revision to Diff 127181.Dec 15 2017, 1:03 PM

Update based on Matt's recent comments. Thanks.

Can you add some assembler tests for this, particularly the 3x components ones

lib/Target/AMDGPU/BUFInstructions.td
925 ↗(On Diff #127181)

You an use VReg_96

1873 ↗(On Diff #127181)

You should be able to define the 3x encoded ones

1881 ↗(On Diff #127181)

You should be able to define the 3x encoded ones

lib/Target/AMDGPU/SIISelLowering.cpp
581–601 ↗(On Diff #127181)

This part should be superseded by D41470

3398 ↗(On Diff #127181)

This should be in a separate function

4747 ↗(On Diff #127181)

Dead code. You need to check the number of elements since there is no v3f16 yet

arsenm added inline comments.Fri, Dec 22, 12:05 PM
lib/Target/AMDGPU/SIISelLowering.cpp
581–601 ↗(On Diff #127181)

Also it's unrelated because this could be done separately for the existing intrinsics, as is done in D41470

3333 ↗(On Diff #127152)

The point of this is to move all of the intrinsic lowering logic to one function. The type change wrapper may be a useful function used by the function called from the switch. Part of the point is to reduce the amount of code nested in switches

cfang marked 11 inline comments as done.Thu, Jan 4, 3:12 PM
cfang updated this revision to Diff 128660.Thu, Jan 4, 3:19 PM

Update based on Reviewer's comments:

  1. define encoding subtarget "GFX80" to encode unpacked d16 buffer/tbuffer instructions;
  2. add MC test cases;
  3. define two functions: SITargetLowering::lowerIntrinsicWChain and SITargetLowering::handleVDataToStore.
arsenm added inline comments.Tue, Jan 9, 9:00 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3552–3553 ↗(On Diff #128660)

This is still structured strangely. Most of the code inside this function should be in a helper, and lowerIntrinsicWChain should have the switch over the intrinsic

3565 ↗(On Diff #128660)

This is not the correct chain, this is the input chain, not the output result chain

3572–3573 ↗(On Diff #128660)

No return after else

3606 ↗(On Diff #128660)

Remove = SDValue()

lib/Target/AMDGPU/SIISelLowering.h
63 ↗(On Diff #128660)

Typo brirf, but autobrief is on so you can just remove it

64 ↗(On Diff #128660)

Use reference for chain out argument, or just pass in the SmallVectorImpl& directly

lib/Target/AMDGPU/SIInstrInfo.td
1899 ↗(On Diff #128660)

You should add a note explaining why this is here and that it should probably be removed at some point

cfang added inline comments.Tue, Jan 9, 9:45 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3565 ↗(On Diff #128660)

I though "Res" is the output here, except that it has to be type-converted to the final result.

So how to get the "output result chain" you mentioned? Thanks.

cfang marked 4 inline comments as done.Tue, Jan 9, 11:41 AM
cfang updated this revision to Diff 129134.Tue, Jan 9, 11:43 AM

Update based on Matt's latest comments.

Please advice how to get the chain if you still don't agree what I got is actually the result chain. Thanks.

cfang updated this revision to Diff 129312.Wed, Jan 10, 11:23 AM

Get the output "Chain" of the intrinsic correctly. Thanks.

arsenm added inline comments.Wed, Jan 10, 2:53 PM
lib/Target/AMDGPU/SIISelLowering.cpp
3508–3509 ↗(On Diff #129312)

We already have LowerINTRINSIC_W_CHAIN, so this name could be confusing

3564 ↗(On Diff #129312)

Move this into a helper function called in each switch rather than putting code after the switch. Not every intrinsic with an illegal type will necessarily be this situation

3569 ↗(On Diff #129312)

No return after else

cfang marked an inline comment as done.Wed, Jan 10, 3:24 PM
cfang added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
3508–3509 ↗(On Diff #129312)

This is actually the name you suggested before.

Do you think lowerIntrinsicWChainWithIllegalReturnType is OK? Or any ither name suggestion? Thanks.

3564 ↗(On Diff #129312)

OK, will do as you suggested. Thanks.

3569 ↗(On Diff #129312)

What do you mean here "no return after else"? Do you mean there should not be a return statement, or a return statement is missing?
Thanks.

cfang updated this revision to Diff 129366.Wed, Jan 10, 4:09 PM
cfang marked an inline comment as done.

Rename a function and add a new helper function based on Matt's comments.

arsenm added inline comments.Thu, Jan 11, 9:56 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4915 ↗(On Diff #129366)

You don't need a pointer out argument here. You can just have handleVDataToStore return SDValue() if it doesn't need to do anything. If it actaully does something you know it is d16

cfang updated this revision to Diff 129490.Thu, Jan 11, 11:24 AM

Redesign the function to handle Vdata to store so that is will be called only when vdata is of type d16.

lib/Target/AMDGPU/SIISelLowering.cpp
4915 ↗(On Diff #129366)

I redesigned this function. It will called called only when VData is the type of "D16".

What do you think?

arsenm accepted this revision.Fri, Jan 12, 11:35 AM

LGTM

This revision is now accepted and ready to land.Fri, Jan 12, 11:35 AM
This revision was automatically updated to reflect the committed changes.