This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix issue for zext of f16 to i32
Needs ReviewPublic

Authored by dstuttard on Sep 11 2018, 4:47 AM.

Details

Reviewers
tpr
arsenm
Summary

Vulkan exposed an issue with this for a case with v_mad_mixlo_f16 where the
upper 16 bits were not cleared.

Modifying this to clear the bits instead of just copying fixed the problem.

V2: Fixed up "Fix issue for zext of f16 to i32"
V3: Fixed fcanonicalize-elimination test

Diff Detail

Event Timeline

dstuttard created this revision.Sep 11 2018, 4:47 AM
arsenm added inline comments.Sep 11 2018, 5:26 AM
lib/Target/AMDGPU/SIInstructions.td
1353

IIRC this node is only supposed to be emitted if the high bits are known zero, so something is wrong upstream if it’s gotten here

dstuttard added inline comments.Sep 11 2018, 6:13 AM
lib/Target/AMDGPU/SIInstructions.td
1353

Ok - thanks. I'll take another look.

Looking again at the code - you're correct that it attempts to only do this transformation if the high bits are zero.
However, the code that checks this has the following telling comment:

// (i32 zext (i16 (bitcast f16:$src))) -> fp16_zext $src
// FIXME: It is not universally true that the high bits are zeroed on gfx9.
if (Src.getOpcode() == ISD::BITCAST) {
  SDValue BCSrc = Src.getOperand(0);
  if (BCSrc.getValueType() == MVT::f16 &&
      fp16SrcZerosHighBits(BCSrc.getOpcode()))
    return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}

In this particular case the BCSrc operation was an fptrunc which passes the fp16SrcZerosHighBits test - but that eventually ends up as v_mad_mixlo_f16 which doesn't ensure that the high bits are zero.

Any suggestions on how to proceed? I agree that it seems a shame to have to insert the extra AND operation blindly.

Looking again at the code - you're correct that it attempts to only do this transformation if the high bits are zero.
However, the code that checks this has the following telling comment:

// (i32 zext (i16 (bitcast f16:$src))) -> fp16_zext $src
// FIXME: It is not universally true that the high bits are zeroed on gfx9.
if (Src.getOpcode() == ISD::BITCAST) {
  SDValue BCSrc = Src.getOperand(0);
  if (BCSrc.getValueType() == MVT::f16 &&
      fp16SrcZerosHighBits(BCSrc.getOpcode()))
    return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}

In this particular case the BCSrc operation was an fptrunc which passes the fp16SrcZerosHighBits test - but that eventually ends up as v_mad_mixlo_f16 which doesn't ensure that the high bits are zero.

Any suggestions on how to proceed? I agree that it seems a shame to have to insert the extra AND operation blindly.

I guess you could check the subtarget in fp16SrcZerosHighBits. However that's pretty risky since it's depending on things we can't guarantee. Something could transform any other instruction into something else that won't preserve this. Overall I'm very unhappy this hardware change happened and it's a lot of work to handle all of this properly. I think what we really need is to drop this combine/node, and a separate machine instruction for every operation that preserves the high bits (with a tied source operand) vs. zeros them, and then have a machine pass that tries to clean up the extra ands while dropping this combine. We'll have to do extra work because we will have missed out on combines that this was enabling.

Looking again at the code - you're correct that it attempts to only do this transformation if the high bits are zero.
However, the code that checks this has the following telling comment:

// (i32 zext (i16 (bitcast f16:$src))) -> fp16_zext $src
// FIXME: It is not universally true that the high bits are zeroed on gfx9.
if (Src.getOpcode() == ISD::BITCAST) {
  SDValue BCSrc = Src.getOperand(0);
  if (BCSrc.getValueType() == MVT::f16 &&
      fp16SrcZerosHighBits(BCSrc.getOpcode()))
    return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}

In this particular case the BCSrc operation was an fptrunc which passes the fp16SrcZerosHighBits test - but that eventually ends up as v_mad_mixlo_f16 which doesn't ensure that the high bits are zero.

Any suggestions on how to proceed? I agree that it seems a shame to have to insert the extra AND operation blindly.

I guess you could check the subtarget in fp16SrcZerosHighBits. However that's pretty risky since it's depending on things we can't guarantee. Something could transform any other instruction into something else that won't preserve this. Overall I'm very unhappy this hardware change happened and it's a lot of work to handle all of this properly. I think what we really need is to drop this combine/node, and a separate machine instruction for every operation that preserves the high bits (with a tied source operand) vs. zeros them, and then have a machine pass that tries to clean up the extra ands while dropping this combine. We'll have to do extra work because we will have missed out on combines that this was enabling.

OK - given that something like that is a larger change, how about we commit this (with an appropriate comment) for now and work on something better in the long term?

@arsenm Matt, any more comments? Would you be happy with a clarification comment as per the last suggestion from me?

ping

What happens if you just drop the optimization entirely?

ping

What happens if you just drop the optimization entirely?

Not sure what you mean - I get a load of lit test failures (31), but that's what I'd expect since it no longer does the transform to use FP16_ZEXT - unless you mean something else?

Herald added a project: Restricted Project. · View Herald TranscriptAug 26 2021, 6:04 PM
Herald added a subscriber: kerbowa. · View Herald Transcript
arsenm resigned from this revision.Jan 19 2022, 6:53 AM

This can be abandoned