Depending on the environment, a floating point instruction should
treat denormal inputs as zero, and/or flush a denormal output to zero.
Denormals are not currently accounted for when an instruction gets
folded to a constant, which can lead to differences in output between
a folded and a unfolded instruction when running on the target. The
denormal handling mode can be set by the function level attribute
denormal-fp-math, which this patch uses to determine whether any
denormal inputs to or outputs from folding should be zero, and that
the sign is set appropriately.
Details
Diff Detail
Event Timeline
Does "original operation" mean what hardware would do? Not all targets support flushing denormals to zero. Even within X86, SSE can flush denormals to zero, but X87 can't.
Thanks for highlighting that. Producing the appropriate result for the hardware was what I meant, so based on that I would have to rework this to handle different targets.
It doesn't look like constant folding currently has such target knowledge, and the only obvious solution I can see would be to use TargetTransformInfo to determine if the target should flush denormals from folding. However this would also mean passing TTI around much more in order to pass it down through to the constant folding functions wherever they get used.
I've updated the patch to use TargetTransformInfo to determine whether the target supports flushing to zero. I'm slightly wary, as there is an warning discouraging using TargetTransformInfo, but I unsure of an alternative.
llvm/lib/Analysis/ConstantFolding.cpp | ||
---|---|---|
1006 | You should probably also explain that the return nullptr means if you don't have a TTI, you are explicitly preventing any constant folding of operations that produce denormals, rather than continuing to fold to the denormal value. |
Ping for any other comments.
I've also added the suggested comment in the code. And yes, one result of this patch would indeed be that calls to constant folding from passes other than instruction combining would not be able to fold floating point instructions that result in a denormal - since they lack TTI. While it would be possible to add TTI to those passes, that becomes a significantly larger change and I was unsure how necessary that would be if instruction combining can already handle this case.
Definitely not a fan of this change, on multiple levels. I don't particularly look forward to having a TTI dependency in constant folding, but more immediately, I don't think I really understand the LangRef basis for this change.
Could you explain which constituent part of fast FMF specifically allows this optimization? I don't see anything related to denormals, and interpreting afn to apply to primitive operations would be a bit of a stretch.
How does this relate to the denormal-fp-math function attribute? I would have thought that this is the one that controls denormal flushing behavior.
Sorry, I didn't look at this review sooner, but I agree with @nikic. The TTI warning comment in instcombine was supposed to prevent this kind of proposal, and this seems to be mixing up seemingly unrelated pieces of FP behavior. It might help to see a source/motivating example to understand if there's any realistic solution to the problem.
Thanks for taking a look, it does appear I misunderstood a few things.
The original motivation was that downstream we found a case where the output of the compiled program is different between O0 and O1: at O0 a floating point instruction executes on the target and produces a zero, but at O1 the instruction gets constant folded to a denormal value. So I have been exploring whether it would be possible to handle denormals at the time of folding, since this occurs before other combination passes (e.g DAGCombiner).
The denormal-fp-math attribute does contains the relevant information about the floating point envionment, and should already be accessible during folding via the instruction pointer (assuming it belongs to a function at the time). So I believe I could potentially rework this to avoid pulling in target info by checking denormal-fp-math instead, if that would be more acceptable. Reading the language reference, the attribute doesn't mandate flushing denormals to zero, but does suggest inputs should be treated as zero, which constant folding also does not currently respect. On testing, this can lead to similar differences in ouput when folding a floating point instruction where one input is a denormal, so it may make sense to check the inputs as well as the output in ConstantFoldInstOperands.
If we can use the function attribute to get the desired result, I think that would be fine.
In an ideal world, we would have all of the FP settings in one place, but FMF became part of the bonus bits in an instruction, and there's not enough space there to represent variations like denorm or sqrt specializations.
If the attribute is not specified as needed, then we should clarify/enhance that in LangRef.
I've updated the patch with a new version which now takes the denormal handling mode from the function attribute, and adjusted the title/summary to reflect this. This supports both different settings for inputs and outputs to the instruction, as well as whether values get flushed positive zero or the sign is preserved.
While testing I found the denormal-fp-math-f32 attribute was being set unexpectedly, but I've created a separate patch to deal with that: https://reviews.llvm.org/D122589. It should not be an issue for this patch however, since it simply uses the attributes as given.
ConstantFoldBinaryOpOperands has other callers; are you planning to fix each of them separately?
Assuming IEEE denormal handling if we can't find the parent Function seems a bit dubious, but maybe it's the best we can do for now. It's not clear to me how this interacts with floating-point constant expressions (e.g. ConstantExpr::getFAdd). I guess we could just kill off floating-point constant expressions, since they aren't really useful in practice, but that's a non-trivial effort.
When I originally checked the other calls to ConstantFoldBinaryOpOperands did not look like they would potentially be handling floating point instructions, although on second look, I missed InstructionSimplify::foldOrCommuteConstant. The same approach should work there too though, so I can expand the patch to cover that usage.
If the instruction lacks a parent function, then the alternative to defaulting to IEEE would be not folding at all. The impact to that could be limited to only instructions to where a denormal is detected in the input/output; if there's no denormal, there's no need for a parent function so folding can proceed.
Constant expressions won't be currently be affected by this, although potentially they could also follow the principle of only getting folded if denormals are not involved.
Maybe... refusing to fold only when we see a denormal might lead to bugs which only show up for specific constants, though. I think the current approach of just continuing to do IEEE folding is fine as an incremental step.
Constant expressions won't be currently be affected by this, although potentially they could also follow the principle of only getting folded if denormals are not involved.
My concern with constant expressions here is mostly optimizations turning instructions into constant expressions, e.g. TargetFolder/ConstantFolder. If frontends create constant expressions, it's less important what happens.
llvm/lib/Analysis/ConstantFolding.cpp | ||
---|---|---|
1004 | fneg is not a computational FP operation; it's a signbit operation. For example on x86 with SSE, it's implemented with a vector integer xor instruction, so it is not affected by denorm FP mode. I'm not sure what happens on targets that have a real fneg instruction. Either way, we need at least one test to check the behavior. |
I've removed FNeg from the changes; indeed it wasn't affected by denormal mode wherever I tried. That did allow me to refactor slightly to wrap around just ConstantFoldBinaryOpOperands, and fall back to that when dealing with a constant expression or functionless instruction just as before. So for the moment the denormal mode information is only applied in situations where it is available, which is one step forward at least.
Please add a testcase for opt -instsimplify.
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
612 ↗ | (On Diff #421535) | Maybe we should just return early if CxtI is null, instead of falling back to ConstantFoldBinaryOpOperands? |
Right - I don't think we want any -instcombine tests with this patch. It should be completely testable from -instsimplify. And we should vary the opcodes (not just fmul), so we have at least partial coverage for each one (plus a negative test for "fneg").
I've moved the tests to instsimplify, and expanded them out to cover more cases and additional instructions. While there may be some overlap, this structures it a bit better and ensures cases don't get conflated: depending on the instruction some zero results can be obtained from either the input getting zeroed or the output getting zeroed, so it's better to test both separately.
llvm/lib/Analysis/InstructionSimplify.cpp | ||
---|---|---|
612 ↗ | (On Diff #421535) | Returning early here would change the result for existing cases where there is no instruction pointer, so constant expressions won't get folded where they would before. |
llvm/test/Transforms/InstCombine/AArch64/constant-fold-fp-denormal.ll | ||
4 ↗ | (On Diff #421535) | I don't think it's necessary to test those if they aren't going to be affected by setting the function attribute. |
I don't understand why we have (duplicate?) tests with -instcombine. Also, why are there ARM and AArch test file variants? IIUC, the target makes no difference - the behavior is completely specified by the function attributes.
I like that we're testing each opcode with each attribute combination for thoroughness, but I'd prefer to have that all in one file rather than split by opcode. Wouldn't it be easier to see the progression if the tests were ordered based on the attributes rather than opcode? Ie, if we're testing that an input is flushed to zero, then that denorm constant could be repeated N times in a row independently of the opcode.
Once we have the right set of tests in place, you can pre-commit them with baseline CHECK lines, and then it will be easier to see how this patch changes functionality (and we can add test comments to explain more if needed).
Sorry, the instcombine tests are from the previous version and shouldn't have been included in the diff.
Originally I put the tests in ARM/AArch64 because those seemed the relevant ones where you'd expect to see the attributes with all the different modes, but you're right; having the tests in both is redundant, and no target is needed at all really when the test is specifying the attributes. So I will combine the opcodes into one file, and can move them down a folder.
On ordering, while one set of inputs would work for multiple attributes/opcodes when testing the inputs are correctly flushed, testing that the output is flushed requires specific inputs for each opcode/attribute combination to produce a subnormal output. Where possible, I tried to pick input values relevant to the opcode where one set of inputs produced different results based on the attribute and grouped based on that, since then the effect of the attribute is visible at a glance. For example with fadd, the same inputs and opcode can produce four different results depending on which attribute is used, and the result of the input getting flushed is distinct from the result when the output is flushed. Keeping those tests together felt more readable than continually changing the inputs to order by attribute first.
define zeroext i1 @foo() #0 { %_add = fadd fast double 1.264810e-321, 3.789480e-321 %_res = fcmp fast une double %_add, 5.054290e-321 ret i1 %_res } attributes #0 = { "denormal-fp-math"="positive-zero" }
llc 1.ll -mtriple=powerpc64le-unknown-linux-gnu
Hi, this patch causes mis-compile for above case, now %_res is true while before this patch it is false. We can not handle the denormal constantFP in fcmp? Will the denormal constantFP be in other opcodes as well?
Thanks.
Alive says that LLVM is correct:
define i1 @foo() denormal-fp-math=positive-zero,positive-zero { %_add = fadd double 0.000000, 0.000000, exceptions=ignore %_res = fcmp une double %_add, 0.000000 ret i1 %_res } => define i1 @foo() noread nowrite nofree willreturn denormal-fp-math=positive-zero,positive-zero { ret i1 1 } Transformation seems to be correct!
Alive 2 says "ERROR: Couldn't prove the correctness of the transformation"...
https://alive2.llvm.org/ce/z/GPLj4Z
And from the semantic of the case, %_add not equal to 5.054290e-321 should be wrong, (1.264810e-321 + 3.789480e-321 == 5.054290e-321), so we should expect false here?
The online version is outdated, sorry.
And from the semantic of the case, %_add not equal to 5.054290e-321 should be wrong, (1.264810e-321 + 3.789480e-321 == 5.054290e-321), so we should expect false here?
%_add = #x00000000000003ff
Which is a subnormal, so per the function attribute it is changed to +0.0.
Hi @nlopes, thanks for providing the useful info. However I am still not very clear about how to deal with our internal failure after this patch.
define i1 @foo() denormal-fp-math=positive-zero,positive-zero { %_add = fadd double 0.000000, 0.000000, exceptions=ignore %_res = fcmp une double %_add, 0.000000 ret i1 %_res } => define i1 @foo() noread nowrite nofree willreturn denormal-fp-math=positive-zero,positive-zero { ret i1 1 }
The alive result is very confusing. I don't understand why 0.000000 + 0.000000 != 0x000000 when denormal-fp-math=positive-zero, could you help to explain?
I know you said online Alive2 is outdated, but seems the online Alive2 gets opposite result for the above 0.000000 case, it verifies ret i1 0 as the valid transformation. https://alive2.llvm.org/ce/z/PjhR3U
There is a C case too:
int main(void) { double a = 1.264810e-321; double b = 3.789480e-321; return (a + b != 5.054290e-321); }
clang 1.c -Ofast -fdenormal-fp-math=positive-zero without this patch, it gets 0 and with this patch, it gets 1. Our internal test expects 0 here.
I also tested above case with XLC(-Ofast -qnostrict)/GCC(-Ofast -funsafe-math-optimizations) on PowerPC, they both get 0.
Could you please tell me what's wrong with our internal failure? Thanks in advance. @nlopes @dcandler
True, the output isn't great (floats are truncated, hence the 0.000000, which is not what's underneath). But what matters is the final result.
I know you said online Alive2 is outdated, but seems the online Alive2 gets opposite result for the above 0.000000 case, it verifies ret i1 0 as the valid transformation. https://alive2.llvm.org/ce/z/PjhR3U
The online version if Alive2 doesn't implement the denormal-fp-math attribute.
There is a C case too:
int main(void) { double a = 1.264810e-321; double b = 3.789480e-321; return (a + b != 5.054290e-321); }clang 1.c -Ofast -fdenormal-fp-math=positive-zero without this patch, it gets 0 and with this patch, it gets 1. Our internal test expects 0 here.
Look at the generated assembly with -O0 without and without -fdenormal-fp-math. There's no difference. So it seems that this flag doesn't guarantee anything (it's best effort) or it's not fully implemented yet.
Nevertheless, your internal test is wrong. Check the math here: https://en.wikipedia.org/wiki/Double-precision_floating-point_format#Exponent_encoding
Thanks. I need some time to have a better understanding.
So GCC/XLC both returning 0 for the C case is caused by -fdenormal-fp-math=positive-zero not implemented or not used in the command line? I tested with clang, without -fdenormal-fp-math=positive-zero, it also returns 0 with this patch.
Catching up after being out for a few days...
Yes, we'll need to make more of FP constant-folding aware of these function attributes to get consistent results.
D128647 looks like it will fix fcmp. We'll need something similar for constant-folded FP intrinsics and libcalls too. And there was a comment about updating LangRef to document the behavior (the flushing mode does not affect signbit ops like fneg/fabs/copysign).
fneg is not a computational FP operation; it's a signbit operation. For example on x86 with SSE, it's implemented with a vector integer xor instruction, so it is not affected by denorm FP mode. I'm not sure what happens on targets that have a real fneg instruction.
Either way, we need at least one test to check the behavior.