Page MenuHomePhabricator

[X86][FP16] Only generate approximate rsqrt when Reciprocal is true for half type
AbandonedPublic

Authored by pengfei on Nov 29 2021, 6:37 PM.

Details

Summary

We have reasonable fast sqrt and accurate rsqrt for half type due to the
limited fractions. So neither do we need multi steps refinement for
rsqrt nor replace sqrt by rsqrt.
This fixes a correctness issue when RefinementSteps = 0.

Diff Detail

Event Timeline

pengfei created this revision.Nov 29 2021, 6:37 PM
pengfei requested review of this revision.Nov 29 2021, 6:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 29 2021, 6:37 PM

What's the correctness issue? It's fast math so the answer isn't required to be exact.

@kmclaughlin I'm meant to change in the target independent code (enabled in D110557), then I found the test in your D111657 was affected too. I'll change there if you think it is workable for aarch64 too.

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I think for f16 you should also be returning true from X86TargetLowering::isFsqrtCheap which would prevent getSqrtEstimate from being called for the non-reciprocal case.

pengfei updated this revision to Diff 390562.Nov 29 2021, 9:21 PM

Return false for f16 in X86TargetLowering::isFsqrtCheap.

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I see. But I'm not sure if it is bug or by design. Maybe need an assertion?

I think for f16 you should also be returning true from X86TargetLowering::isFsqrtCheap which would prevent getSqrtEstimate from being called for the non-reciprocal case.

Good point!

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I see. But I'm not sure if it is bug or by design. Maybe need an assertion?

I think this is a weird interface quirk. The caller interprets returning RefinementSteps = 0 to mean that all needed code has been created and nothing should be done. Theoretically a target could have its own way of handling it without the final FMUL the target independent code inserts. So I think X86 should either insert the final FMUL itself or not do the reciprocal approximation for non-reciprocal if RefinementSteps is 0. NVPTXTargetLowering::getSqrtEstimate does the latter.

pengfei updated this revision to Diff 390577.Nov 29 2021, 11:10 PM

So I think X86 should either insert the final FMUL itself or ...

Insert a FMUL. Thanks Craig!

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I see. But I'm not sure if it is bug or by design. Maybe need an assertion?

I think this is a weird interface quirk. The caller interprets returning RefinementSteps = 0 to mean that all needed code has been created and nothing should be done. Theoretically a target could have its own way of handling it without the final FMUL the target independent code inserts. So I think X86 should either insert the final FMUL itself or not do the reciprocal approximation for non-reciprocal if RefinementSteps is 0. NVPTXTargetLowering::getSqrtEstimate does the latter.

That looks like a long-standing bug for x86. I don't remember if that was a design flaw originally or if the code evolved to handle more patterns, and we just missed that problem.
Better to add the tests first, add the code to fix the bug, then add the code to bypass for f16?

llvm/test/CodeGen/X86/avx512fp16vl-intrinsics.ll
972

Does it make sense to add a scalar test too?

define half @sqrt_half_fast(half %a0, half %a1) {
  %1 = call fast half @llvm.sqrt.f16(half %a0)
  ret half %1
}
pengfei abandoned this revision.Nov 30 2021, 10:05 PM
pengfei marked an inline comment as done.

Break it into 3 patches: rG65a3de91ab3e, D114843 and D114844.