This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add support for TFE/LWE in image intrinsics
ClosedPublic

Authored by dstuttard on Jul 2 2018, 4:32 AM.

Details

Summary

TFE and LWE support requires extra result registers that are written in the
event of a failure in order to detect that failure case.
The specific use-case that initiated these changes is sparse texture support.

This means that if image intrinsics are used with either option turned on, the
programmer must ensure that the return type can contain all of the expected
results. This can result in redundant registers since the vector size must be a
power-of-2.

This change takes roughly 5 parts:

  1. Modify the instruction defs in tablegen to add new instruction variants that

can accomodate the extra return values.

  1. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE

(where the bulk of the work for these instruction types is now done)

  1. Extra verification code to catch cases where intrinsics have been used but

insufficient return registers are used.

  1. Modification to the adjustWritemask optimisation to account for TFE/LWE being

enabled (requires extra registers to be maintained for error return value).

  1. An extra pass to zero initialize the error value return - this is because if

the error does not occur, the register is not written and thus must be zeroed
before use. Also added a new (on by default) option to ensure ALL return values
are zero-initialized that is required for sparse texture support.

Tests have been added/modified to test the new behaviour.

Diff Detail

Repository
rL LLVM

Event Timeline

dstuttard created this revision.Jul 2 2018, 4:32 AM

@nhaehnle - just added you as reviewer at the moment.

I wasn't sure if the separate pass to do the mimg result initialization would be better done in the lowerImage in SIISelLowering.cpp, but thought I'd submit what I've got to canvas opinion.

tpr added inline comments.Jul 2 2018, 6:30 AM
lib/Target/AMDGPU/AMDGPU.h
146 ↗(On Diff #153696)

This accidental double semicolon gave me lots of warnings.

arsenm added a comment.Jul 2 2018, 7:10 AM

I would expect the intrinsics to change for this. You can use a struct return type, which is what I would expect for this. Something like { <4 x float>, i1 }? You also could have a 5 element vector, it would just require more work to deal with during lowering

arsenm added inline comments.Jul 2 2018, 7:13 AM
lib/Target/AMDGPU/SIAddIMGInit.cpp
16 ↗(On Diff #153696)

This needs to go below includes

27 ↗(On Diff #153696)

Why do you need to include this?

46 ↗(On Diff #153696)

You don't need this, the default comes from the INITIALIZE_PASS

83 ↗(On Diff #153696)

MI.mayStore() works

84–86 ↗(On Diff #153696)

Capitalize

112 ↗(On Diff #153696)

New line

tpr added inline comments.Jul 4 2018, 12:58 PM
lib/Target/AMDGPU/AMDGPU.td
380 ↗(On Diff #153696)

Surely this feature should be called enable-prt-strict-null?

nhaehnle added inline comments.Jul 27 2018, 7:13 AM
lib/Target/AMDGPU/SIISelLowering.cpp
7799 ↗(On Diff #153696)

This should be a bool.

lib/Target/AMDGPU/SIInstrInfo.cpp
2780 ↗(On Diff #153696)

Can also use !MI.mayStore()

2793–2798 ↗(On Diff #153696)

It's a minor thing, but I think it would be easier to follow to first divide the RegCount for D16 based on (D16 && D16->getImm() && !ST.hasUnpackedD16VMem()), and then increment that for LWE || TFE afterwards.

test/CodeGen/AMDGPU/llvm.amdgcn.image.dim.ll
15–19 ↗(On Diff #153696)

This doesn't test everything we'd want to test here, which is that in the NOPRT case, you don't have initializations of v0-v3.

I'd suggest adding STRICTPRT and NONSTRICTPRT check prefixes, and adding GCN to the new run line as well, so you don't need to duplicate the image_load line, and you can use NONSTRICTPRT-NOT lines.

test/CodeGen/AMDGPU/llvm.amdgcn.image.sample.dim.ll
21 ↗(On Diff #153696)

How about checks for initialization here?

tpr added a comment.Aug 29 2018, 6:32 AM

Please also fold in the fix to disable InstCombine for image ops with tfe/lfe (AMDVLK fixes c5675e2d, fea7c135, 7b7276ec).

lib/Target/AMDGPU/SIISelLowering.cpp
7798 ↗(On Diff #153696)

Please use operand names (per AMDVLK fix 458248e9).

Folded in most of the changes highlighted in the review

There are also several new changes to the original patch in light of bugs
uncovered during further testing with applications using the support.

I haven't attempted to implement Matt's suggestion of using a struct return type
or 5-vec support. Could we add that as something to do at a later date?
Nicolai, do you have any feedback on that piece of work - my feeling is that it
is quite a complicated change based on how the intrinsics are defined. If not, I
can take another look if you can give some pointers.

dstuttard marked 14 inline comments as done.Sep 11 2018, 10:35 AM
dstuttard added inline comments.
lib/Target/AMDGPU/SIInstrInfo.cpp
2793–2798 ↗(On Diff #153696)

Agreed - not really sure how I arrived at the original non-obvious way. Suspect it evolved.

dstuttard marked 2 inline comments as done.Sep 11 2018, 10:36 AM

Thanks! Some small issues let, and please also add a test case for the "simplifyDemanded" implementation.

As for the question of return types: on second thought,having <8 x half> as a return type when halves 5 & 6 are really combined to a single i32 LWE/TFE return is indeed a bit awkward. It would make sense to have {<4 x half>, i32} instead at the LLVM IR level. The intrinsics definition themselves should relax fairly easily. You need to replace llvm_anyfloat_ty by llvm_any_ty in:

  • AMDGPUDimSampleProfile
  • defms for int_amdgcn_image_load{,_mip}

The machine instructions themselves would stay the same. The only risk is the SelectionDAG itself, though I don't see anything that would fail or be particularly complicated to deal with right now. I believe the struct type just gets split into two separate result values of the SDNode.

lib/Target/AMDGPU/SIAddIMGInit.cpp
103 ↗(On Diff #164921)

Variable names should be capitalized, here and below.

170 ↗(On Diff #164921)

Space between MF and MI.

lib/Target/AMDGPU/SIISelLowering.cpp
4643 ↗(On Diff #164921)

The LoadVT.isVector() is redundant.

lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
935–941 ↗(On Diff #164921)

TFE and LWE enables are combined in a single argument, so this needs to be fixed.

test/CodeGen/AMDGPU/llvm.amdgcn.image.dim.ll
620 ↗(On Diff #164921)

Oops :)

dstuttard updated this revision to Diff 170823.Oct 24 2018, 1:00 AM

Changed the implementation of the intrinsic return type to be an aggregate type

dstuttard marked 5 inline comments as done.Oct 24 2018, 1:03 AM

Covered all the requested changes (I think). Also implemented a test to make sure that simplifyDemanded doesn't run when TFE/LWE is enabled.

Thank you for making these changes. I have some detail comments inline and some high-level remarks:

  • We need tests for {float, i32}, {<2 x float>, i32}, and {half, i32} return types
  • We need tests with dmask = 0 from the very beginning in the intrinsic (there's a case in lowerImage which looks like it may be broken), and a test with dmask != 0 in the intrinsic, but none of the data returns are being used (only the i32 return is used).
  • We need tests where the dmask is materially smaller than the return type, e.g. return type {<4 x float>, i32} and popcount(dmask) < 4, and return type {<2 x half>, i32} and popcount(dmask) <= 2.

I also have a high-level concern about the design of lowerImage, because the adjustRetValue feels a bit like spooky action at a distance, plus concern about the last point above about dmask. Have you considered the following possibility:

  • Keep ReturnTypes[0] as-is until the code that determines NumVDataDwords.
  • In the load case, adjust NumVDataDwords first based on popcount(dmask) and then based on whether LWE/TFE is enabled.
  • Synthesize a new ReturnTypes[0] from scratch based on NumVDataDwords
  • Then generically extract the data payloads vector from the NewNode, doing cast and adjust-for-unpackedD16 in a common path for both with and without TFE/LWE.
lib/Target/AMDGPU/SIAddIMGInit.cpp
7 ↗(On Diff #170823)

The more common practice seems to be having a separator line here between the license info and the file description.

20–21 ↗(On Diff #170823)

I believe you don't need these two includes.

86–87 ↗(On Diff #170823)

I'd prefer this to be an assertion. It's easy enough to change if we ever do get MIMG instructions without TFE/LWE. In the meantime, assertions allow us to be more explicit about the space of possibilities we actually need to think about.

108 ↗(On Diff #170823)

Same argument here: I'd prefer this to be an assertion.

117 ↗(On Diff #170823)

It seems prudent to add a static assertion that AMDGPU::sub0 == 1 && AMDGPU::sub4 == 5 here.

lib/Target/AMDGPU/SIISelLowering.cpp
4722 ↗(On Diff #170823)

This can be moved lower.

lib/Transforms/InstCombine/InstCombineInternal.h
803 ↗(On Diff #170823)

Should be TFCIdx for consistency.

dstuttard updated this revision to Diff 171699.Oct 30 2018, 7:51 AM

Made minor code changes suggested in review

dstuttard marked 7 inline comments as done.Oct 30 2018, 7:57 AM

Minor changes made.

I'll look at implementing the extra tests you highlight - I think at least one of those might already be covered, but I'll verify that during the implementation of the new ones.
I think we might also have an issue with non-TFE/LWE variants that have a return type that is larger than the dmask (the last of your bullets) - so fixing this for the TFE/LWE case might also fix that - I'll check.

Regarding the suggested re-factor to remove the spooky effect at a distance code :) - agreed that it is a bit of a muddle as it stands. I think your suggestion is a good one so I'll look at re-factoring as suggested.

lib/Target/AMDGPU/SIAddIMGInit.cpp
86–87 ↗(On Diff #170823)

Agreed, that makes sense

Thanks for making the changes.

I think we got away with return type larger than dmask previously, because in the worst case we just over-conservatively allocate registers. But now that the number of registers actually matters because it's used to access the TFE/LWE return...

dstuttard updated this revision to Diff 172942.Nov 7 2018, 6:21 AM
dstuttard marked an inline comment as done.

Modified based on review feedback from Nicolai

Not sure that the new code is any less spooky, but I'm sure it is more robust.
New requested tests also implemented.
Rebased onto latest trunk and fixed new issues with a16 support.

Added Neil as a reviewer as I've made some changes to some of his a16 tests. I'm pretty certain that the modifications are correct, but wanted to get feedback on that as well.

I think that we can get rid of adjustWriteMask altogether at some point, but it will require extra support in the instcombine for the image instructions to support TFE/LWE (it just gives up at the moment), but it is definitely wrong to have this code in 2 places. However, we can do that as a separate change - this one is getting quite large already!

Thank you, this looks much cleaner. I only have a small number of nitpicks left.

lib/Target/AMDGPU/SIISelLowering.cpp
885 ↗(On Diff #172942)

Space after the comma.

4668 ↗(On Diff #172942)

ResultTypes can be const, right? Use ArrayRef if yes, SmallVectorImpl otherwise.

4711–4713 ↗(On Diff #172942)

I think you can directly ExtractVectorElements into BVElts and get rid of Elts entirely.

4980–4984 ↗(On Diff #172942)

Please put braces around the multi-line if "block" here.

dstuttard updated this revision to Diff 174613.Nov 19 2018, 7:35 AM
dstuttard marked 4 inline comments as done.

Thanks for the review - made all the suggested changes

lib/Target/AMDGPU/SIISelLowering.cpp
4711–4713 ↗(On Diff #172942)

Doh.

This revision is now accepted and ready to land.Nov 29 2018, 4:04 AM
This revision was automatically updated to reflect the committed changes.