Page MenuHomePhabricator

[ConstantFolding] Respect denormal handling mode attributes when folding instructions
Needs ReviewPublic

Authored by dcandler on Jan 10 2022, 9:15 AM.

Details

Summary

Depending on the environment, a floating point instruction should
treat denormal inputs as zero, and/or flush a denormal output to zero.
Denormals are not currently accounted for when an instruction gets
folded to a constant, which can lead to differences in output between
a folded and a unfolded instruction when running on the target. The
denormal handling mode can be set by the function level attribute
denormal-fp-math, which this patch uses to determine whether any
denormal inputs to or outputs from folding should be zero, and that
the sign is set appropriately.

Diff Detail

Unit TestsFailed

TimeTest
60,090 msx64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -shared-libasan -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxDynamicConfig/TestCases/Output/scariness_score_test.cpp.tmp
60,110 msx64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Output/scariness_score_test.cpp.tmp
60,050 msx64 debian > Clang.CodeGenCXX::dllimport-members.cpp
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/15.0.0/include -nostdsysteminc -no-opaque-pointers -no-enable-noundef-analysis -disable-llvm-passes -triple i686-windows-msvc -fms-compatibility -emit-llvm -std=c++1y -O0 -o - /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGenCXX/dllimport-members.cpp -DMSABI | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=MSC --check-prefix=M32 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGenCXX/dllimport-members.cpp
60,490 msx64 debian > Clang.Driver::fsanitize.c
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -target x86_64-linux-gnu -fsanitize=undefined -fsanitize-trap=undefined /var/lib/buildkite-agent/builds/llvm-project/clang/test/Driver/fsanitize.c -### 2>&1 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/clang/test/Driver/fsanitize.c --check-prefix=CHECK-UNDEFINED-TRAP
60,110 msx64 debian > Clang.OpenMP::target_teams_distribute_parallel_for_simd_codegen_registration.cpp
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/15.0.0/include -nostdsysteminc -no-opaque-pointers -verify -fopenmp -fopenmp-version=45 -x c++ -triple powerpc64le-unknown-unknown -fopenmp-targets=powerpc64le-ibm-linux-gnu -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/OpenMP/target_teams_distribute_parallel_for_simd_codegen_registration.cpp -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --allow-unused-prefixes /var/lib/buildkite-agent/builds/llvm-project/clang/test/OpenMP/target_teams_distribute_parallel_for_simd_codegen_registration.cpp
View Full Test Results (8 Failed)

Event Timeline

dcandler created this revision.Jan 10 2022, 9:15 AM
dcandler requested review of this revision.Jan 10 2022, 9:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 10 2022, 9:15 AM

Does "original operation" mean what hardware would do? Not all targets support flushing denormals to zero. Even within X86, SSE can flush denormals to zero, but X87 can't.

Thanks for highlighting that. Producing the appropriate result for the hardware was what I meant, so based on that I would have to rework this to handle different targets.

It doesn't look like constant folding currently has such target knowledge, and the only obvious solution I can see would be to use TargetTransformInfo to determine if the target should flush denormals from folding. However this would also mean passing TTI around much more in order to pass it down through to the constant folding functions wherever they get used.

lenary added a subscriber: lenary.Jan 13 2022, 4:31 AM
dcandler updated this revision to Diff 401607.Jan 20 2022, 5:15 AM
dcandler added a reviewer: nikic.

I've updated the patch to use TargetTransformInfo to determine whether the target supports flushing to zero. I'm slightly wary, as there is an warning discouraging using TargetTransformInfo, but I unsure of an alternative.

lenary added inline comments.Jan 20 2022, 5:22 AM
llvm/lib/Analysis/ConstantFolding.cpp
1004

You should probably also explain that the return nullptr means if you don't have a TTI, you are explicitly preventing any constant folding of operations that produce denormals, rather than continuing to fold to the denormal value.

dcandler updated this revision to Diff 403669.Jan 27 2022, 8:30 AM

Ping for any other comments.

I've also added the suggested comment in the code. And yes, one result of this patch would indeed be that calls to constant folding from passes other than instruction combining would not be able to fold floating point instructions that result in a denormal - since they lack TTI. While it would be possible to add TTI to those passes, that becomes a significantly larger change and I was unsure how necessary that would be if instruction combining can already handle this case.

nikic requested changes to this revision.Jan 29 2022, 1:27 AM
nikic added a reviewer: lebedev.ri.

Definitely not a fan of this change, on multiple levels. I don't particularly look forward to having a TTI dependency in constant folding, but more immediately, I don't think I really understand the LangRef basis for this change.

Could you explain which constituent part of fast FMF specifically allows this optimization? I don't see anything related to denormals, and interpreting afn to apply to primitive operations would be a bit of a stretch.

How does this relate to the denormal-fp-math function attribute? I would have thought that this is the one that controls denormal flushing behavior.

This revision now requires changes to proceed.Jan 29 2022, 1:27 AM

Sorry, I didn't look at this review sooner, but I agree with @nikic. The TTI warning comment in instcombine was supposed to prevent this kind of proposal, and this seems to be mixing up seemingly unrelated pieces of FP behavior. It might help to see a source/motivating example to understand if there's any realistic solution to the problem.

Thanks for taking a look, it does appear I misunderstood a few things.

The original motivation was that downstream we found a case where the output of the compiled program is different between O0 and O1: at O0 a floating point instruction executes on the target and produces a zero, but at O1 the instruction gets constant folded to a denormal value. So I have been exploring whether it would be possible to handle denormals at the time of folding, since this occurs before other combination passes (e.g DAGCombiner).

The denormal-fp-math attribute does contains the relevant information about the floating point envionment, and should already be accessible during folding via the instruction pointer (assuming it belongs to a function at the time). So I believe I could potentially rework this to avoid pulling in target info by checking denormal-fp-math instead, if that would be more acceptable. Reading the language reference, the attribute doesn't mandate flushing denormals to zero, but does suggest inputs should be treated as zero, which constant folding also does not currently respect. On testing, this can lead to similar differences in ouput when folding a floating point instruction where one input is a denormal, so it may make sense to check the inputs as well as the output in ConstantFoldInstOperands.

Thanks for taking a look, it does appear I misunderstood a few things.

The original motivation was that downstream we found a case where the output of the compiled program is different between O0 and O1: at O0 a floating point instruction executes on the target and produces a zero, but at O1 the instruction gets constant folded to a denormal value. So I have been exploring whether it would be possible to handle denormals at the time of folding, since this occurs before other combination passes (e.g DAGCombiner).

The denormal-fp-math attribute does contains the relevant information about the floating point envionment, and should already be accessible during folding via the instruction pointer (assuming it belongs to a function at the time). So I believe I could potentially rework this to avoid pulling in target info by checking denormal-fp-math instead, if that would be more acceptable. Reading the language reference, the attribute doesn't mandate flushing denormals to zero, but does suggest inputs should be treated as zero, which constant folding also does not currently respect. On testing, this can lead to similar differences in ouput when folding a floating point instruction where one input is a denormal, so it may make sense to check the inputs as well as the output in ConstantFoldInstOperands.

If we can use the function attribute to get the desired result, I think that would be fine.
In an ideal world, we would have all of the FP settings in one place, but FMF became part of the bonus bits in an instruction, and there's not enough space there to represent variations like denorm or sqrt specializations.
If the attribute is not specified as needed, then we should clarify/enhance that in LangRef.

dcandler updated this revision to Diff 418604.Mar 28 2022, 8:52 AM
dcandler retitled this revision from [ConstantFolding] Flush folded denormals to zero when using fastmath to [ConstantFolding] Respect denormal handling mode attributes when folding instructions.
dcandler edited the summary of this revision. (Show Details)

I've updated the patch with a new version which now takes the denormal handling mode from the function attribute, and adjusted the title/summary to reflect this. This supports both different settings for inputs and outputs to the instruction, as well as whether values get flushed positive zero or the sign is preserved.

While testing I found the denormal-fp-math-f32 attribute was being set unexpectedly, but I've created a separate patch to deal with that: https://reviews.llvm.org/D122589. It should not be an issue for this patch however, since it simply uses the attributes as given.

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 8:52 AM
Herald added a subscriber: StephenFan. · View Herald Transcript

ConstantFoldBinaryOpOperands has other callers; are you planning to fix each of them separately?

Assuming IEEE denormal handling if we can't find the parent Function seems a bit dubious, but maybe it's the best we can do for now. It's not clear to me how this interacts with floating-point constant expressions (e.g. ConstantExpr::getFAdd). I guess we could just kill off floating-point constant expressions, since they aren't really useful in practice, but that's a non-trivial effort.

When I originally checked the other calls to ConstantFoldBinaryOpOperands did not look like they would potentially be handling floating point instructions, although on second look, I missed InstructionSimplify::foldOrCommuteConstant. The same approach should work there too though, so I can expand the patch to cover that usage.

If the instruction lacks a parent function, then the alternative to defaulting to IEEE would be not folding at all. The impact to that could be limited to only instructions to where a denormal is detected in the input/output; if there's no denormal, there's no need for a parent function so folding can proceed.

Constant expressions won't be currently be affected by this, although potentially they could also follow the principle of only getting folded if denormals are not involved.

If the instruction lacks a parent function, then the alternative to defaulting to IEEE would be not folding at all. The impact to that could be limited to only instructions to where a denormal is detected in the input/output; if there's no denormal, there's no need for a parent function so folding can proceed.

Maybe... refusing to fold only when we see a denormal might lead to bugs which only show up for specific constants, though. I think the current approach of just continuing to do IEEE folding is fine as an incremental step.

Constant expressions won't be currently be affected by this, although potentially they could also follow the principle of only getting folded if denormals are not involved.

My concern with constant expressions here is mostly optimizations turning instructions into constant expressions, e.g. TargetFolder/ConstantFolder. If frontends create constant expressions, it's less important what happens.

spatel added inline comments.Mar 30 2022, 11:21 AM
llvm/lib/Analysis/ConstantFolding.cpp
1002

fneg is not a computational FP operation; it's a signbit operation. For example on x86 with SSE, it's implemented with a vector integer xor instruction, so it is not affected by denorm FP mode. I'm not sure what happens on targets that have a real fneg instruction.

Either way, we need at least one test to check the behavior.

dcandler updated this revision to Diff 421535.Apr 8 2022, 8:23 AM

I've removed FNeg from the changes; indeed it wasn't affected by denormal mode wherever I tried. That did allow me to refactor slightly to wrap around just ConstantFoldBinaryOpOperands, and fall back to that when dealing with a constant expression or functionless instruction just as before. So for the moment the denormal mode information is only applied in situations where it is available, which is one step forward at least.

Please add a testcase for opt -instsimplify.

llvm/lib/Analysis/InstructionSimplify.cpp
612

Maybe we should just return early if CxtI is null, instead of falling back to ConstantFoldBinaryOpOperands?

Please add a testcase for opt -instsimplify.

Right - I don't think we want any -instcombine tests with this patch. It should be completely testable from -instsimplify. And we should vary the opcodes (not just fmul), so we have at least partial coverage for each one (plus a negative test for "fneg").

arsenm added a subscriber: arsenm.Apr 18 2022, 1:37 PM
arsenm added inline comments.
llvm/lib/Analysis/ConstantFolding.cpp
1327

Don't need llvm::

llvm/test/Transforms/InstCombine/AArch64/constant-fold-fp-denormal.ll
3 ↗(On Diff #421535)

Don't need a specific target here

4 ↗(On Diff #421535)

Would it be helpful to have some constantexpr cases in a global initializer?

dcandler updated this revision to Diff 426602.Tue, May 3, 1:51 AM

I've moved the tests to instsimplify, and expanded them out to cover more cases and additional instructions. While there may be some overlap, this structures it a bit better and ensures cases don't get conflated: depending on the instruction some zero results can be obtained from either the input getting zeroed or the output getting zeroed, so it's better to test both separately.

dcandler marked 3 inline comments as done.Tue, May 3, 1:56 AM
dcandler added inline comments.
llvm/lib/Analysis/InstructionSimplify.cpp
612

Returning early here would change the result for existing cases where there is no instruction pointer, so constant expressions won't get folded where they would before.

llvm/test/Transforms/InstCombine/AArch64/constant-fold-fp-denormal.ll
4 ↗(On Diff #421535)

I don't think it's necessary to test those if they aren't going to be affected by setting the function attribute.

spatel added a comment.Wed, May 4, 7:33 AM

I don't understand why we have (duplicate?) tests with -instcombine. Also, why are there ARM and AArch test file variants? IIUC, the target makes no difference - the behavior is completely specified by the function attributes.
I like that we're testing each opcode with each attribute combination for thoroughness, but I'd prefer to have that all in one file rather than split by opcode. Wouldn't it be easier to see the progression if the tests were ordered based on the attributes rather than opcode? Ie, if we're testing that an input is flushed to zero, then that denorm constant could be repeated N times in a row independently of the opcode.

Once we have the right set of tests in place, you can pre-commit them with baseline CHECK lines, and then it will be easier to see how this patch changes functionality (and we can add test comments to explain more if needed).

Sorry, the instcombine tests are from the previous version and shouldn't have been included in the diff.

Originally I put the tests in ARM/AArch64 because those seemed the relevant ones where you'd expect to see the attributes with all the different modes, but you're right; having the tests in both is redundant, and no target is needed at all really when the test is specifying the attributes. So I will combine the opcodes into one file, and can move them down a folder.

On ordering, while one set of inputs would work for multiple attributes/opcodes when testing the inputs are correctly flushed, testing that the output is flushed requires specific inputs for each opcode/attribute combination to produce a subnormal output. Where possible, I tried to pick input values relevant to the opcode where one set of inputs produced different results based on the attribute and grouped based on that, since then the effect of the attribute is visible at a glance. For example with fadd, the same inputs and opcode can produce four different results depending on which attribute is used, and the result of the input getting flushed is distinct from the result when the output is flushed. Keeping those tests together felt more readable than continually changing the inputs to order by attribute first.

dcandler updated this revision to Diff 430100.Tue, May 17, 9:49 AM

Tests moved out and pre-committed in https://reviews.llvm.org/D125807