This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Implement d16 support for buffer intrinsics
ClosedPublic

Authored by cfang on Oct 13 2017, 4:06 PM.

Details

Summary

This patch implements buffer_load_format and tbuffer_load_format intrinsics that support half data types.

While types that are not legal currently ( v4f16, for example), we are using ReplaceNodeResults to change the
type and cast it back after customer lowering.

Diff Detail

Event Timeline

cfang created this revision.Oct 13 2017, 4:06 PM
arsenm added inline comments.Oct 16 2017, 2:18 PM
lib/Target/AMDGPU/SIISelLowering.cpp
204

This also applies to VI, but it doesn't really matter if it is legal or not. you can still set it to custom

3230

Should put this into a new function

3232–3234

This should not be an assert, but should also probably handle v3f16

3258–3261

This should be moved into getTgtMemIntrinsic so the MMO already exists

3277–3280

Ditto

3284

Return on different line

3286

Typo yo, pu

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.d16.ll
1 ↗(On Diff #118982)

GCN as check prefix. Also should have a VI check line? Is this the one where the output register layout can differ on some gfx8 variant?

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.load.d16.ll
15–16

Should check operands

cfang added inline comments.Oct 17 2017, 3:40 PM
lib/Target/AMDGPU/SIISelLowering.cpp
3258–3261

I could not understand your intention here. Can you be more specific how to move this to "bool SITargetLowering::getTgtMemIntrinsic"?
And how to get MMO? Thanks.

arsenm added inline comments.Oct 23 2017, 9:44 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3258–3261

If you implement getTgtMemIntrinsic it will add the same MMO to the intrinsic. You then just need to forward it to the new node. This is still suboptimal since we should really be using the buffer PseudoSourceValue instead of a null MachinePointerInfo

cfang updated this revision to Diff 120119.Oct 24 2017, 1:00 PM

Update based on Matt's review comments.

cfang added inline comments.Oct 24 2017, 1:05 PM
lib/Target/AMDGPU/SIISelLowering.cpp
204

OK

3232–3234

v3f16 as TODO.

3258–3261

Implemented getTgtMemIntrinsic, and made appropriate changes. Thanks.

3286

Thanks.

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.d16.ll
1 ↗(On Diff #118982)

I am not sure. But the ISA for the instructions are exactly the same for both VI and gfx9. So I didn't add VI check.

cfang updated this revision to Diff 120631.Oct 27 2017, 10:06 AM

Implement getTgtMemIntrinsic for buffer_load/tbuffer_load intrinsics.
Update based on Matt's review comments.

cfang updated this revision to Diff 121560.Nov 3 2017, 3:38 PM
cfang retitled this revision from AMDGPU/SI: Implement d16 support buffer_load_format and tbuffer_load_format intrinsics to AMDGPU/SI: Implement d16 support for buffer intrinsics.
cfang added a comment.Nov 3 2017, 3:38 PM
  1. Define a new feature, HasPackedD16VMem for gfx8.1 and beyond to guard the generation of VMem instructions with D16 bit set.
  2. Put buffer_store intrinsics implementation in the same patch as buffer_loads;
arsenm added inline comments.Nov 6 2017, 2:14 AM
lib/Target/AMDGPU/AMDGPU.td
289

I think we should invert this, to HasUnpackedD16Mem. It's only the one weird target, the packed layout is the expected one and for every other subtarget.

lib/Target/AMDGPU/BUFInstructions.td
675

Comment belongs somewhere else. We usually have the true opcode name as all caps, and modifiers like this as lowercase. In this case we should maybe call it _gfx81 instead.

lib/Target/AMDGPU/SIISelLowering.cpp
549

Since you really want to get the i16 equivalent vector, you can use changeTypeToInteger on the f16 type.

589

I think you can assume the ABI type alignment for these (at least for the scalar type) not that it probably matters

3268

s/ChangeResultType/lowerIntrinsicWChain/

3274

switch over IID

3345

I'm pretty sure there should be no f32 or FP_ROUND here. This is returning the half format result just expanded into multiple 32-bit registers. They are still half, you need to truncate the int type and bitcast to f16.

4548–4549

Same thing as with stores

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.format.d16.ll
2

These should have a common GCN check prefix to avoid the duplicated -LABEL lines

6

Spaces after ;

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.load.d16.ll
35

The waitcnts aren't interesting to check. it would be more useful to extract and store the components to see what instructions are required.

cfang added inline comments.Nov 6 2017, 9:54 AM
lib/Target/AMDGPU/SIISelLowering.cpp
549

For v2f16, I want something that uses 2 registers, so v2i16 doesn't work for me.
What should be the type that specifies the two registers that hold the two half values? Here I am using v2f32.

3345

Do you mean the returned type is not v2f32? so what is the type that returned? v2f16? I just could not understand how to represent the two half values in two registers!

I hope you can be more specific on what to do here.

cfang marked 11 inline comments as done.Nov 7 2017, 11:36 AM
cfang added inline comments.
lib/Target/AMDGPU/AMDGPU.td
289

OK. Use HasUnpackedD16VMem.

lib/Target/AMDGPU/BUFInstructions.td
675

Done. Use _gfx80 because only gfx80 has the feature "HasUnpackedD16VMem".

lib/Target/AMDGPU/SIISelLowering.cpp
549

Right. So Remove this function definition.

589

Not sure how to get the ABI type alignment here since we are using nullptr as ptrval.

3268

Changed func name to lowerIntrinsicWChain.

3274

OK switch.

3345

For v2f16, changed the intrinsic load type to v2i32 if the target has unpacked vmem instructions. After the load,
we truncate the data to v2i16, and then bitcast it back to v2f16. v4f16 case is done similarly.

4548–4549

For store of v2f16, we first bitcast the data to v2i16, and then zero_extend it to v2i32 to store. v4f16 case is done similarly.

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.format.d16.ll
6

Thanks.

cfang updated this revision to Diff 121946.Nov 7 2017, 11:37 AM
cfang marked 7 inline comments as done.

Update based on Matt's comments.

b-sumner edited edge metadata.Nov 10 2017, 12:16 PM

Pardon my ignorance, but why isn't include/llvm/IR/IntrinsicsAMDGPU.td being updated?

Pardon my ignorance, but why isn't include/llvm/IR/IntrinsicsAMDGPU.td being updated?

The intrinsics use overloading based on the type, so no new intrinsics need to be defined.

lib/Target/AMDGPU/BUFInstructions.td
1464–1475

Are the pattens with SItbuffer_load and 16-bit types ever used? I believe the f16 maybe, but isn't the v2f16 getting replaced by ReplaceNodeResults?

I should add that apart from the one comment I had, the change looks good to me. It would be great if we could avoid those added custom SD nodes somehow, though...

cfang added a comment.Nov 22 2017, 1:30 PM
lib/Target/AMDGPU/BUFInstructions.td
1464–1475

ReplaceNodeResults will only replace illegal vector types (v2f16 on gfx8 and v4f16 on gfx8+).
So the pattens with SItbuffer_load and v2f16 are used for gfx9+ (and f16 are used for gfx8+).

cfang updated this revision to Diff 127023.Dec 14 2017, 2:09 PM
  1. Merge the patch with LLVM trunk.
  2. Update LIT tests to avoid specific registers.
cfang updated this revision to Diff 127152.Dec 15 2017, 9:33 AM

Merge with the latest LLVM trunk.

We need to keep the ball rolling! Please advice what else to do in order to move ahead. Thanks.

arsenm added inline comments.Dec 15 2017, 9:55 AM
lib/Target/AMDGPU/SIISelLowering.cpp
556–561

You don't need this, you can just do VT.getScalarType()

3333

I'm confused because all of the intrinsic logic isn't contained here, and this also just always uses the d16 version which is wrong

3412

The only use of this is the negated condition, so just don't negate it

4390–4393

This is an unrelated change

4711

This is redundant with the other type check, you really want isVector() and to just assert on v3f16

4720

Missing space before //

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.format.d16.ll
2

Use -enable-var-scope

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.load.d16.ll
2

-enable-var-scope

test/CodeGen/AMDGPU/llvm.amdgcn.tbuffer.store.d16.ll
2

-enable-var-scope

cfang marked 6 inline comments as done.Dec 15 2017, 12:17 PM
cfang added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
3333

SDValue Res = lowerIntrinsicWChain(Op, EquivLoadVT, DAG)

This is just a help function to replace the result type of "Op" with " EquivLoadVT". maybe we should use a different name to avoid confusion.

What about: ReplaceResultType ? Thanks.

4390–4393

Why do you think this is unrelated? I have to handle Intrinsic::amdgcn_buffer_load_format in getTgtMemIntrinsic.
After return "true" there, I think it is better to make appropriate change here also.

4711

Do you think we need to assert for v3f16? The code in this block has no reason not to handle v3f16. So the assert of v3f16, if any, should be somewhere else. Also MVY::v3f16 is not recognized in LC.

cfang updated this revision to Diff 127181.Dec 15 2017, 1:03 PM

Update based on Matt's recent comments. Thanks.

Can you add some assembler tests for this, particularly the 3x components ones

lib/Target/AMDGPU/BUFInstructions.td
925

You an use VReg_96

1873

You should be able to define the 3x encoded ones

1881

You should be able to define the 3x encoded ones

lib/Target/AMDGPU/SIISelLowering.cpp
581–601

This part should be superseded by D41470

3398

This should be in a separate function

4747

Dead code. You need to check the number of elements since there is no v3f16 yet

arsenm added inline comments.Dec 22 2017, 12:05 PM
lib/Target/AMDGPU/SIISelLowering.cpp
581–601

Also it's unrelated because this could be done separately for the existing intrinsics, as is done in D41470

3333

The point of this is to move all of the intrinsic lowering logic to one function. The type change wrapper may be a useful function used by the function called from the switch. Part of the point is to reduce the amount of code nested in switches

cfang marked 11 inline comments as done.Jan 4 2018, 3:12 PM
cfang updated this revision to Diff 128660.Jan 4 2018, 3:19 PM

Update based on Reviewer's comments:

  1. define encoding subtarget "GFX80" to encode unpacked d16 buffer/tbuffer instructions;
  2. add MC test cases;
  3. define two functions: SITargetLowering::lowerIntrinsicWChain and SITargetLowering::handleVDataToStore.
arsenm added inline comments.Jan 9 2018, 9:00 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3552–3553

This is still structured strangely. Most of the code inside this function should be in a helper, and lowerIntrinsicWChain should have the switch over the intrinsic

3565

This is not the correct chain, this is the input chain, not the output result chain

3572–3573

No return after else

3602

Remove = SDValue()

lib/Target/AMDGPU/SIISelLowering.h
63

Typo brirf, but autobrief is on so you can just remove it

64

Use reference for chain out argument, or just pass in the SmallVectorImpl& directly

lib/Target/AMDGPU/SIInstrInfo.td
1899

You should add a note explaining why this is here and that it should probably be removed at some point

cfang added inline comments.Jan 9 2018, 9:45 AM
lib/Target/AMDGPU/SIISelLowering.cpp
3565

I though "Res" is the output here, except that it has to be type-converted to the final result.

So how to get the "output result chain" you mentioned? Thanks.

cfang marked 4 inline comments as done.Jan 9 2018, 11:41 AM
cfang updated this revision to Diff 129134.Jan 9 2018, 11:43 AM

Update based on Matt's latest comments.

Please advice how to get the chain if you still don't agree what I got is actually the result chain. Thanks.

cfang updated this revision to Diff 129312.Jan 10 2018, 11:23 AM

Get the output "Chain" of the intrinsic correctly. Thanks.

arsenm added inline comments.Jan 10 2018, 2:53 PM
lib/Target/AMDGPU/SIISelLowering.cpp
3508–3509

We already have LowerINTRINSIC_W_CHAIN, so this name could be confusing

3564

Move this into a helper function called in each switch rather than putting code after the switch. Not every intrinsic with an illegal type will necessarily be this situation

3569

No return after else

cfang marked an inline comment as done.Jan 10 2018, 3:24 PM
cfang added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
3508–3509

This is actually the name you suggested before.

Do you think lowerIntrinsicWChainWithIllegalReturnType is OK? Or any ither name suggestion? Thanks.

3564

OK, will do as you suggested. Thanks.

3569

What do you mean here "no return after else"? Do you mean there should not be a return statement, or a return statement is missing?
Thanks.

cfang updated this revision to Diff 129366.Jan 10 2018, 4:09 PM
cfang marked an inline comment as done.

Rename a function and add a new helper function based on Matt's comments.

arsenm added inline comments.Jan 11 2018, 9:56 AM
lib/Target/AMDGPU/SIISelLowering.cpp
4909

You don't need a pointer out argument here. You can just have handleVDataToStore return SDValue() if it doesn't need to do anything. If it actaully does something you know it is d16

cfang updated this revision to Diff 129490.Jan 11 2018, 11:24 AM

Redesign the function to handle Vdata to store so that is will be called only when vdata is of type d16.

lib/Target/AMDGPU/SIISelLowering.cpp
4909

I redesigned this function. It will called called only when VData is the type of "D16".

What do you think?

arsenm accepted this revision.Jan 12 2018, 11:35 AM

LGTM

This revision is now accepted and ready to land.Jan 12 2018, 11:35 AM
This revision was automatically updated to reflect the committed changes.