This is an archive of the discontinued LLVM Phabricator instance.

IR: Add llvm.ldexp and llvm.experimental.constrained.ldexp intrinsics
ClosedPublic

Authored by arsenm on Nov 4 2015, 2:54 AM.

Details

Summary

AMDGPU has native instructions and target intrinsics for this, but
these really should be subject to legalization and generic
optimizations. This will enable legalization of f16->f32 on targets
without f16 support.

Implement a somewhat horrible inline expansion for targets without
libcall support. This could be better if we could introduce control
flow (GlobalISel version not yet implemented). Support for strictfp
legalization is less complete but works for the simple cases.

Diff Detail

Event Timeline

nhaehnle updated this revision to Diff 39178.Nov 4 2015, 2:54 AM
nhaehnle retitled this revision from to Add llvm.ldexp.* intrinsic, associated SDNode and library calls.Nov 4 2015, 2:54 AM
nhaehnle updated this object.
arsenm added a comment.Nov 4 2015, 9:34 AM

This mostly LGTM except for the question of error behavior. There should be a few additions to get more of the benefits of using an intrinsic over a libcall. ldexp should be added to isTriviallyVectorizable and isSafeToSpeculativelyExecute with appropriate tests, assuming we can assume it doesn't set errno. This could be a follow up patch.

docs/LangRef.rst
9889

I don't think this should be defined it to handling the same way as libm. I think we should say it does not set errno, and then to only do the libcall transformation if the call is marked readonly/readnone. This is an area that isn't handled particularly consistently by the existing math intrinsics.

test/CodeGen/AMDGPU/llvm.ldexp.ll
21

Should include vector versions for at least v2f32, v4f32 and v2f64

Also, can you merge the existing llvm.AMDGPU.ldexp.ll test into this one and rename them with a legacy_ prefix

arsenm added a subscriber: hfinkel.Nov 4 2015, 9:35 AM
nhaehnle updated this revision to Diff 39403.Nov 5 2015, 12:59 PM

Thank you for taking a look! I've made some changes based on your feedback:

  • AMDGPU: more llvm.ldexp.ll tests and assorted bugfixes
  • LangRef for llvm.ldexp.*: remove statement about handling error conditions
  • [VectorUtils] llvm.ldexp.* intrinsic is vectorizable
  • [ValueTracking] ldexp preserves the sign of its first argument

I agree that the error handling is a problem, and I have to admit that I don't
know what is best. At the time of the libcall transformation, we already have
an SDNode, so I do not know how to tell the attributes of the original call.

It's also some effort to provide an expansion that is guaranteed to never set
errno, because the most straightforward expansion uses exp2, which is in turn
likely to become a library call. I suppose one could write a custom implementation
in compiler-rt, but I don't think that that's the best use of my time.

For now, I have made changes that are in line with the other intrinsics like pow
and powi: those are marked as isTriviallyVectorizable, but *not* as
isSafeToSpeculativelyExecute.

I hope that this is good enough. There are quite a number of TODOs already in
the code regarding these error problems. In any case, I've left those changes
as separate commits locally, so it's easy enough for me to rearrange them.
(Though at least for some of them I believe they should definitely be squashed
before committing to SVN.)

Couldn't the original bug be fixed by marking ldexpf as unavailable for AMDGPU in lib/Analysis/TargetLibraryInfo.cpp ?

Couldn't the original bug be fixed by marking ldexpf as unavailable for AMDGPU in lib/Analysis/TargetLibraryInfo.cpp ?

I think so, yes. Though Matt said that we do want to use the ldexp instruction because it is a full-rate instruction.

Couldn't the original bug be fixed by marking ldexpf as unavailable for AMDGPU in lib/Analysis/TargetLibraryInfo.cpp ?

I think so, yes. Though Matt said that we do want to use the ldexp instruction because it is a full-rate instruction.

Ok, so for a temporary solution, rather than changing the intrinsic emitted by Mesa, I think we should mark this libcall as unavailable. This current patch could then be done as a follow up.

arsenm added inline comments.Jan 19 2016, 1:49 PM
test/CodeGen/X86/ldexp.ll
4

Vector tests here are probably a good idea as well

hfinkel added inline comments.Feb 2 2016, 5:03 PM
docs/LangRef.rst
9889

As I recall, we're very consistent about this, with one exception: @llvm.sqrt. And this causes a lot of confusion. That having been said, there is a precedent, and there are good reasons to do it. However, we do need to say what happens if the result is not representable. You really have two choices:

  1. "and handles error conditions in the same way" (i.e. perhaps sets errno)
    1. Has undefined behavior (it needs to be undefined because it might be implemented using libm, and we can't know whether libm will affect errno)
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
3251

This seems like a great idea is FEXP2 is legal, but otherwise, seems likely slower than the original library function call to ldexp. Unless we really know better, we should keep the original call.

lib/CodeGen/TargetLoweringBase.cpp
873

You should add FLDEXP here too.

nhaehnle updated this revision to Diff 47547.Feb 10 2016, 3:16 PM
nhaehnle marked 4 inline comments as done.

Rebased on top of current trunk and addressed the various comments.

Since the TargetAction now defaults to Expand (which is actually LibCall
in disguise when available), I have removed several places where targets
redundantly set the action.

I've opted to go the "undefined range error" behaviour route in the revision since that seemed more useful to me given that he LibCallSimplifier is intrinsic->intrinsic and libcall->libcall now.

arsenm edited edge metadata.Mar 2 2016, 6:54 PM

Could also use updating some IR places to handle it (e.g. TTI, isSafeToSpeculativelyExecute), but that's probably a separate patch

docs/LangRef.rst
9889

The returned value on underflow is defined to be zero, and HUGE_VAL, which may be infinity, on overflow. I think saying undefined behavior for the case is too strong. Maybe saying just the state of errno is undefined?

hfinkel added inline comments.Apr 26 2016, 6:00 PM
docs/LangRef.rst
9889

We don't have a way to model errno. We need to "prevent" a situation where we're allowed to reorder a call to ldexp in between, for example, a call to open() and a call to perror(). To get the benefits you want, however, you need to mark the function as readnone. However, it might be implemented using the underlying library call, which might set errno. Unless you make that undefined behavior, then the readnone on the intrinsic is wrong. Both overflow and underflow need to be undefined behavior. I realize that this is unfortunate.

lib/Target/PowerPC/PPCISelLowering.cpp
460

Don't do this. Set it to Expand by default (in TargetLoweringBase::initActions). That's our current best practice for new rarely-legal nodes.

arsenm added inline comments.Apr 27 2016, 1:27 PM
docs/LangRef.rst
9889

The converse is we already don't 'correctly' lower the existing intrinsics which are assumed to write errno because errno does not exist on the platform.

I'm still generally confused about the inconsistency of errno handling. Why don't we have a separate set of math intrinsics for respecting errno, and not? Lowering the non-errno version with a library call would be an incorrect lowering for these. Alternatively, why doesn't the possibility of of writing errno always be a libcall, while the intrinsics are fine for -fno-math-errno? Currently -fno-math-errno adds readnone to the call site of the library call, and allows selecting to the corresponding DAG node. The inconsistency in behavior between the DAG nodes and intrinsics has always confused me. A readnone call to the library function will select to the corresponding chainless node, which could still be lowered to a call to an errno writing function. In the case of sqrt, this is further confused because < 0 inputs are no longer undefined. I would expect the intrinsics would be the for using a native instruction which ignores errno.

The current set of math intrinsics, including those that say handle errors the same way, are already IntrNoMem (e.g. llvm.exp) and say nothing about undefined behavior. The sqrt intrinsic has undefined behavior for < 0, but we are able to fold an isnan() check before it out in the DAG. I'm not sure what an underflow/overflow test for ldexp would look like, but it would be more complicated than the simple compare and select for sqrt.

jfb added a comment.Apr 27 2016, 3:18 PM

Regarding errno: it's totally valid to ignore if the implementation sets math_errhandling & MATH_ERRNO to zero. Of course, you need to know the C library to make that choice, but its value never changes at runtime. See C11 section 7.12, as well as the soon-to-be-published C++ paper p0108r1 which you can preview here.

arsenm commandeered this revision.May 1 2023, 7:24 AM
arsenm edited reviewers, added: nhaehnle; removed: arsenm.
Herald added a project: Restricted Project. · View Herald TranscriptMay 1 2023, 7:24 AM
arsenm updated this revision to Diff 518436.EditedMay 1 2023, 7:31 AM
arsenm retitled this revision from Add llvm.ldexp.* intrinsic, associated SDNode and library calls to IR: Add llvm.ldexp and llvm.experimental.constrained.ldexp intrinsics.
arsenm edited the summary of this revision. (Show Details)

Rebase forward 7 years. Add constrained version and GlobalISel support. Fix promotion for f16->f32. Replace fpow2 based legalization with an integer expansion which actually passes opencl conformance. Drop some redundant checks for the libcall signature.

Also fix treating the second operand as a fixed scalar value instead of a vector

Herald added a project: Restricted Project. · View Herald TranscriptMay 1 2023, 7:31 AM
arsenm added a reviewer: foad.May 1 2023, 7:33 AM
kpn added a comment.May 1 2023, 8:57 AM

The constrained intrinsic version is not documented here.

tra added a subscriber: tra.May 1 2023, 11:28 AM

It would be great to add some tests for NVPTX as the patch may hit some corner cases there. NVPTX has no libcalls and fp16 support depends on the GPU variant (no fp16 before sm_60).

arsenm added a comment.May 2 2023, 5:16 AM
In D14327#4310305, @tra wrote:

NVPTX has no libcalls

The TargetLibraryInfo query says there is ldexp and then it doesn't work

arsenm updated this revision to Diff 518687.May 2 2023, 5:18 AM

Copy-paste docs like the other constrained intrinsics (is there a reason we don't just document them all as pairs?)

tra added a comment.May 2 2023, 10:54 AM

The TargetLibraryInfo query says there is ldexp and then it doesn't work

Interesting. How exactly does it fail? I'm pretty sure we used to make libcalls unavailable in the past (I think we could not lower the calls to them), but I'm having a hard time finding that code now. It may have changed when we've improved handling the unsupported libcalls in NVPTX.

foad added inline comments.May 3 2023, 9:01 AM
llvm/test/CodeGen/AMDGPU/llvm.ldexp.ll
259 ↗(On Diff #518687)

This doesn't quite work because the instruction truncates v1 to 16 bits, so if you wanted ldexp(1.0, 0x10000) aka +inf you'll actually get ldexp(1.0, 0) aka 1.0.

In D14327#4313028, @tra wrote:

The TargetLibraryInfo query says there is ldexp and then it doesn't work

Interesting. How exactly does it fail?

LLVM ERROR: Undefined external symbol "ldexpf"

llvm/test/CodeGen/AMDGPU/llvm.ldexp.ll
259 ↗(On Diff #518687)

Ugh, the library does have clamp code for this. The tablegen definition claims this is VOP_F16_F16_I32 though

arsenm updated this revision to Diff 519328.May 3 2023, 6:49 PM

Clamp when truncating exp

This mostly LGTM, but it looks like some GlobalISel legalization is missing relative to SelectionDAG?

This mostly LGTM, but it looks like some GlobalISel legalization is missing relative to SelectionDAG?

Yes. The full legalization expansion should be different, since it's possible to introduce control flow. I didn't see the point handling it right now since the only case I'm sure that expands now is x86 windows, which isn't complete enough to write an end to end test for.

nhaehnle accepted this revision.May 15 2023, 5:59 AM

Okay, makes sense.

This revision is now accepted and ready to land.May 15 2023, 5:59 AM
Joe_Nash added inline comments.
llvm/test/CodeGen/AMDGPU/llvm.ldexp.ll
5 ↗(On Diff #519328)

Typo GFX1

arsenm marked an inline comment as done.May 18 2023, 9:20 AM
arsenm updated this revision to Diff 523725.May 19 2023, 3:39 AM

Update some MC tests for operand change. Disassembler seems to have a bizarre behavior where it takes invalid instructions and prints invalid instructions with larger encodings than they started with

llvm/docs/ReleaseNotes.rst
59 ↗(On Diff #523725)

Nit: mention constrained version as well?

arsenm updated this revision to Diff 525794.May 25 2023, 2:08 PM
arsenm marked an inline comment as done.

release notes

foad added inline comments.Jun 7 2023, 6:31 AM
llvm/test/MC/Disassembler/AMDGPU/gfx10_vop3.txt
7523 ↗(On Diff #525794)

What caused this change in the assembler/disassembler behaviour? It looks like it has broken round-tripping, since the "encoding" output is longer than the input.

arsenm added inline comments.Jun 7 2023, 6:40 AM
llvm/test/MC/Disassembler/AMDGPU/gfx10_vop3.txt
7523 ↗(On Diff #525794)

The exp operand was incorrectly marked as i32 when it's really i16. The inline immediate values are then different

Joe_Nash added inline comments.Jun 7 2023, 7:00 AM
llvm/test/MC/Disassembler/AMDGPU/gfx10_vop3.txt
7523 ↗(On Diff #525794)

I believe that operand should be f16. We still want to be able to assemble inline fp constants. From a semantic point of view, these are i16 constants, but from an encoding point of view they are f16.
In the True16 support downstream I have been treating that argument as f16. If you want it to be i16 yet still support inline fp constants, we need to effectively revert 5f5f566b265db00f577ead268400d99f34ba9cdd

arsenm added inline comments.Jun 7 2023, 7:16 AM
llvm/test/MC/Disassembler/AMDGPU/gfx10_vop3.txt
7523 ↗(On Diff #525794)

It is an i16 operand. In the broken hardware handling of the f16 inline immediates, +- 0.5/1.0/2.0/4.0 are all effectively aliases for 0. The assembler now rejects these as invalid literals. I don't really understand the disassembler's handling of this invalid case

foad added inline comments.Jun 7 2023, 7:29 AM
llvm/test/MC/AMDGPU/gfx10_asm_vop2.s
12937 ↗(On Diff #525794)

The assembler now rejects these as invalid literals.

Looks like it is still accepting -4.0 here?

arsenm added inline comments.Jun 7 2023, 7:41 AM
llvm/test/MC/AMDGPU/gfx10_asm_vop2.s
12937 ↗(On Diff #525794)

It's being accepted as a 32-bit literal, which is valid on gfx10