This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Support neg{.ftz} for f16 and f16x2
ClosedPublic

Authored by jchlanda on Oct 7 2022, 12:50 AM.

Details

Summary

f16/f16x2 neg was introduced in PTX 60, SM_53.

Diff Detail

Event Timeline

jchlanda created this revision.Oct 7 2022, 12:50 AM
jchlanda requested review of this revision.Oct 7 2022, 12:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2022, 12:50 AM
tra accepted this revision.Oct 7 2022, 10:30 AM

Just curious -- what prompts this change?

Does it buy us anything performance-wise? AFAICT llvm may be generating better code for gpus w/o fp16 support -- it does xor on 32-bit value w/o splitting it into 16-bit halfs. https://godbolt.org/z/Wjx7ceT75
Or is it needed to flush fp16 denormals consistently?

This revision is now accepted and ready to land.Oct 7 2022, 10:30 AM

Just curious -- what prompts this change?

Does it buy us anything performance-wise? AFAICT llvm may be generating better code for gpus w/o fp16 support -- it does xor on 32-bit value w/o splitting it into 16-bit halfs. https://godbolt.org/z/Wjx7ceT75
Or is it needed to flush fp16 denormals consistently?

In all honesty I don't know what the motivation for this was, it came to my attention as a DPC++ bug (https://github.com/intel/llvm/issues/6958). I do think that your point about flushing behavior is important and should be preserved.
FWIW, using negdirectly does not require a bitcast from Float16x2Regs to Int32Regs or Float16Regs to Int16Regs, as seen in the xor case.

        // .globl       test_neg_f16
.visible .func  (.param .b32 func_retval0) test_neg_f16(
        .param .b32 test_neg_f16_param_0
)
{
        .reg .b16       %h<3>;

        ld.param.b16    %h1, [test_neg_f16_param_0];
        neg.f16         %h2, %h1;
        st.param.b16    [func_retval0+0], %h2;
        ret;

}
        // .globl       test_neg_f16x2
.visible .func  (.param .align 4 .b8 func_retval0[4]) test_neg_f16x2(
        .param .align 4 .b8 test_neg_f16x2_param_0[4]
)
{
        .reg .b32       %hh<3>;

        ld.param.b32    %hh1, [test_neg_f16x2_param_0];
        neg.f16x2       %hh2, %hh1;
        st.param.b32    [func_retval0+0], %hh2;
        ret;

}
tra added a comment.Oct 11 2022, 10:59 AM

In all honesty I don't know what the motivation for this was, it came to my attention as a DPC++ bug (https://github.com/intel/llvm/issues/6958). I do think that your point about flushing behavior is important and should be preserved.

FWIW, using negdirectly does not require a bitcast from Float16x2Regs to Int32Regs or Float16Regs to Int16Regs, as seen in the xor case.

Such bitcasts are essentially no-ops once ptxas is done with them. PTX ends up being a bit more verbose, but it usually has no impact on the SASS. FP and integers are kept in the same registers on the actual hardware. I've commented on the original bug.

Anyways, I think this change is fine. I just wanted to make sure I'm not missing something.

In all honesty I don't know what the motivation for this was, it came to my attention as a DPC++ bug (https://github.com/intel/llvm/issues/6958). I do think that your point about flushing behavior is important and should be preserved.

FWIW, using negdirectly does not require a bitcast from Float16x2Regs to Int32Regs or Float16Regs to Int16Regs, as seen in the xor case.

Such bitcasts are essentially no-ops once ptxas is done with them. PTX ends up being a bit more verbose, but it usually has no impact on the SASS. FP and integers are kept in the same registers on the actual hardware. I've commented on the original bug.

Anyways, I think this change is fine. I just wanted to make sure I'm not missing something.

Thank you for explaining and commenting on the github issue, I had a feeling that those extra moves would be swizzled into nothing when generating sass.
Would you be so kind and land this patch for me?

This revision was automatically updated to reflect the committed changes.