This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add intrinsics for 16 bit interpolation
ClosedPublic

Authored by timcorringham on May 11 2018, 6:48 AM.

Details

Summary

Added the intrinsics llvm.amdgcn.interp.p1.f16() and
llvm.amdgcn.interp.p2.f16() and related LIT test.

The p1 intrinsic generates code appropriate for both 16 and 32
bank LDS.

Diff Detail

Event Timeline

timcorringham created this revision.May 11 2018, 6:48 AM
timcorringham added reviewers: Restricted Project, dstuttard, arsenm, tpr.May 11 2018, 6:51 AM
arsenm requested changes to this revision.May 14 2018, 4:26 AM
arsenm added inline comments.
include/llvm/IR/IntrinsicsAMDGPU.td
970

You should add name mangling to the existing intrinsics rather than new intrinsics. The builtin declaration needs to be done in clang for the GCCBuiltin

This revision now requires changes to proceed.May 14 2018, 4:26 AM
timcorringham added inline comments.May 15 2018, 4:56 AM
include/llvm/IR/IntrinsicsAMDGPU.td
970

I now have the clang changes in D46871 (I have added the 32 bit interp builtins too as they were missing).

I don't believe it is possible to overload these intrinsics as they have an extra operand compared to the 32 bit versions. Also apart from the extra operand the signature of the 16 bit p1 intrinsic is identical to the 32 bit one, so there iosn't any type difference to overload.

Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.

I have not overloaded the intrinsics as I don't believe it is possible
in this case as they have an additional operand, and apart from that
additional operand the interp_p1_f16 has the same types as the 32 bit
version, so there are no type differences to provide disambiguation.

Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.

I have not overloaded the intrinsics as I don't believe it is possible
in this case as they have an additional operand, and apart from that
additional operand the interp_p1_f16 has the same types as the 32 bit
version, so there are no type differences to provide disambiguation.

Is the extra parameter you're referring the high parameter to change where the register is read from the high or low bits? That shouldn't be exposed in the intrinsic at all. Eliminating the high bit extraction is a codegen optimization pattern

Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.

I have not overloaded the intrinsics as I don't believe it is possible
in this case as they have an additional operand, and apart from that
additional operand the interp_p1_f16 has the same types as the 32 bit
version, so there are no type differences to provide disambiguation.

Is the extra parameter you're referring the high parameter to change where the register is read from the high or low bits? That shouldn't be exposed in the intrinsic at all. Eliminating the high bit extraction is a codegen optimization pattern

Or is this bit controlling the weird load from memory? The manual isn't particularly clear to me. I see mention of LDs loads, but also op_sel control of destination bits

arsenm added inline comments.May 18 2018, 10:55 AM
lib/Target/AMDGPU/AMDGPUSearchableTables.td
52–53

Should get a test in test/DivergenceAnalysis

Even without the high operand I don't think it is possible to overload interp_p1 and interp_p1_f16 as they would have identical types - there is nothing to disambiguate them.

Or is this bit controlling the weird load from memory? The manual isn't particularly clear to me. I see mention of LDs loads, but also op_sel control of destination bits

Yes, the high bit controls the LDS access. As all the operands to interp_p1_f16 are the same types as for the 32 bit variant, I don't know of any way to deduce the value of the high bit if it isn't specified explicitly.

Added a divergence LIT test for the 16 bit interp intrinsics.

tpr added inline comments.May 22 2018, 11:34 AM
lib/Target/AMDGPU/VOP3Instructions.td
459

Don't forget to fix the problem found with this i1 in testing.

Change the omod operand type to be i32 rather than i1, to avoid
a build failure when building using a debug TableGen.

[AMDGPU] Add intrinsics for 16 bit interpolation

Added a new pass to to ensure that the 16 bit interpolation
instructions use the round to zero rounding mode.

A slighly more performant implementation of the pass to add any
required changes to the double precision rounding mode.

Refactored pass to insert rounding mode to use a style more in line
with other LLVM passes. This fails to optimize a few corner cases,
but they are expected to occur very rarely if at all.

Changed mode register pass to use an explicit stack instead of recursion.

Removed the mode register pass, as that will be introduced as a
separate change.

arsenm added inline comments.Jul 27 2018, 1:53 AM
test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll
2–4

Use -'s instead of _'s in the check prefixes

6–8

Might as well just use update_llc_test_checks at this point?

Updated the LIT test as per review comments.

timcorringham marked 2 inline comments as done.

Rebased, and amended LIT test now that the required mode register
pass has been committed.

arsenm added inline comments.Jan 22 2019, 2:14 PM
test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll
58

Can you add a test case with LDS usage to make sure m0 is properly restored after?

arsenm added inline comments.Jan 22 2019, 2:54 PM
lib/Target/AMDGPU/AMDGPUSearchableTables.td
52–53

Test still missing

timcorringham marked an inline comment as done.

Extended llvm.amdgcn.interp.f16.ll to check that m0 is set before
each interp instruction if necessary, and added a new LIT test
to check that the interp f16 intrinsics are identified as being
divergent.

timcorringham marked 2 inline comments as done.Jan 24 2019, 9:33 AM
timcorringham added inline comments.
test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll
58

I have added test cases to check that m0 is set up before each of the interp f16 instructions if necessary. I have done this by explicitly writing to m0 rather than using LDS as I couldn't see a way to do the latter, and other tests use the technique of writing to m0.

arsenm accepted this revision.Jan 25 2019, 9:13 AM

LGTM

This revision is now accepted and ready to land.Jan 25 2019, 9:13 AM
This revision was automatically updated to reflect the committed changes.