This is an archive of the discontinued LLVM Phabricator instance.

Add constrained fptrunc and fpext intrinsics
ClosedPublic

Authored by kpn on Dec 19 2018, 12:20 PM.

Diff Detail

Event Timeline

kpn created this revision.Dec 19 2018, 12:20 PM
kpn added a comment.Jan 2 2019, 9:03 AM

I just realized I missed a couple of changes. Let me work on those and I'll update later.

kpn updated this revision to Diff 182019.Jan 16 2019, 5:44 AM

It was pointed out before I split this ticket out that I wasn't properly passing the correct operand in ExpandNode() when handling the two new STRICT nodes. I've corrected that.

cwabbott added inline comments.
docs/LangRef.rst
14529

Hi,

I've been working on implementing a Vulkan extension which allows the user to specifiy different rounding modes for the AMDGPU backend. I'm not sure how this works in C/C++, but we're required to support floating-point truncation with non-standard rounding modes. Is there a reason the rounding mode isn't an argument here?

kpn marked an inline comment as done.Jan 25 2019, 9:24 AM
kpn added inline comments.
docs/LangRef.rst
14529

Going back and rereading D43515, I don't see an explicit reason given back then. And I can't find anything in the C99 or IEEE 754 standards, or in the LLVM documentation, that would mandate any particular rounding mode. So I'm open to adding a rounding mode argument to the constrained fptrunc.

Andrew? What do you think of making constrained fptrunc go back to taking a rounding mode argument?

Having said that, the constrained FP intrinsics are to avoid optimizations that change program behavior taking traps into account. Is this the behavior you need for Vulcan?

cwabbott added inline comments.Jan 28 2019, 7:30 AM
docs/LangRef.rst
14529

Vulkan doesn't support trapping floating-point exceptions, so we don't have to worry about that. However, my understanding is that we still need to communicate the rounding mode to LLVM, to prevent it from constant folding floating-point operations with the wrong rounding mode, so we still need to use the intrinsics.

There's also the complication that we may need to emit some "internal" floating-point operations which need to have some defined rounding mode different from what the user specified. The backend has total control over the control register that specifies the rounding mode, and in some cases (e.g. fp32 -> fp16 truncation) the rounding mode is actually specified statically by the instruction itself rather than the control register, so my thought was that we could make AMDGPU just always use the rounding specified by the argument to the constrained intrinsic, emitting changes to the control register if necessary. I'm not sure if the CodeGen infrastructure is set up to do that.

cwabbott added inline comments.Jan 28 2019, 9:16 AM
docs/LangRef.rst
14529

Oh, and I forgot another thing: the extension also adds support for letting the user either flush or preserve denormalized values. However, this is per-source-module, and sometimes we need to stitch together multiple source modules which have different rounding needs, emitting an instruction in between them to change the denorm flushing and/or rounding mode. So it seems we really do need to use the constrained intrinsics, to prevent code motion of floating-point operations around that register setting.

kpn added a comment.Jan 29 2019, 7:18 AM

OK, I'm working on it now.

kpn updated this revision to Diff 184127.Jan 29 2019, 11:02 AM

Add back a rounding mode argument to constrained fptrunc as requested by Connor Abbott.

docs/LangRef.rst
14529

We should definitely have a rounding mode argument for fptrunc. I think the reason we missed that the first time around is probably due to the unfortunate naming of this operation (i.e. it isn't actually truncating) and the confusion with ISD::ROUND.

This is looking pretty good. I don't think I know the Selection DAG well enough to offer a proper review of that. I'll see if I can get Craig's attention on it.

docs/index.rst
194

This seems a bit too prominently placed. Most people don't care about these intrinsics. I would recommend sinking this down into the subsytem documentation section. There is no clear organization there, so it's hard to say where it should go. Maybe just above or below the exception handling section (just based on my perception of the generality of each).

lib/IR/Verifier.cpp
4659

The formatting is non-standard and inconsistent in this section.

4674

You could combine these two lines as:

if (auto *VecTy = dyn_cast<VectorType>(OperandTy))

4685

Redefining "Operand" here makes the code confusing. I'd rather see Operand and Result as separate variables and make local variables for their types. I would also make the Assert statements more specific to what they are actually checking. For instance,

if (OperandTy->isVectorTy()) {

Assert(ResultTy->isVectorTy(), ...
Assert(OperandTy->getVectorNumElements() == ResultTy->getVectorNumElements(), ...

}

Is there a reason that these vector checks are specific to fptrunc and fpext? Is it just because they don't have a "same type" restriction in the intrinsic definition?

The floating point type assertions could be simplified as

Assert(OperandTy->isFPorFPVectorTy(),...
Assert(ResultTy->isFPorFPVectorTy(),...

You should also be checking that OperandTy->getScalarSizeInBits() > ResultTy->getScalarSizeInBits for fptrunc and vice versa for fpext.

craig.topper added inline comments.Jan 31 2019, 2:19 PM
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
711

What makes us reach this case? I would expect we'd scalarize based on the result type before we got to the operand type.

3059

We can't really widen this can we? Won't that put garbage in the upper elements?

craig.topper added inline comments.Jan 31 2019, 2:19 PM
test/CodeGen/X86/vector-constrained-fp-intrinsics.ll
2566

Please add an AVX command line so v4f64 will be a legal type.

kpn marked 6 inline comments as done.Feb 5 2019, 9:46 AM
kpn added inline comments.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
711

Test case CodeGen/AArch64/neon-fpround_f128.ll says this code is needed. That case goes through the non-STRICT version of this function. This STRICT function was copied from the non-STRICT function and called in the appropriate places alongside that function. And a trivial conversion of the test case to use the constrained intrinsics does indeed go through this STRICT function.

I did want to try to keep the functions unified, but sometimes the result was too ugly to live.

3059

Is that what getUNDEF() does? Give llvm license to put garbage in registers? My assumption is that the code is fine because this function is a copy of WidenVecRes_Convert() with the needed changes for the strict node being chained. If if there's a problem here there's also a problem in that function.

lib/IR/Verifier.cpp
4674

The rewrite moots this point.

4685

I've rewritten this code and with your suggestions it does look much nicer.

I think I did put the vector checks are for fptrunc and fpext because they aren't checked earlier.

I've also added the checks for the appropriate changes in ScalarSizeInBits.

kpn updated this revision to Diff 185337.Feb 5 2019, 9:48 AM

Address review comments.

craig.topper added inline comments.Feb 5 2019, 10:19 AM
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3059

That is what the undef means. The existing code isn't required to be exception safe so garbage is fine. The constrained intrinsics have to be exception safe. This is why the implementation of WidenVecRes_StrictFP is different than the non-trapping case in WidenVecRes_BinaryCanTrap. I believe FADD/FSUB/FDIV/FMUL are considered non-trapping on most targets.

kpn marked an inline comment as done.Feb 8 2019, 10:11 AM

I'm seeing a regression in WebAssembly/PR40267.ll that I need to look into. If anyone else is seeing this failure let me know.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3059

It looks like this "WidenNumElts % InVTNumElts" block can just be eliminated. The fallback at the end of the function should handle the case without any undefs.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2846

Would this lowering to EmitStackConvert(...) discard the rounding mode? I'm not familiar with this code, but I *think* it would.

E.g. STRICT_FP_ROUND would lower to a truncating store.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
711

Have you considered scalarizing the operands with a generic function? Something like ScalarizeVecRes_StrictFPOp(...)? I suspect we'll see reuse of this code.

@craig.topper

What makes us reach this case? I would expect we'd scalarize based on the result type before we got to the operand type.

The different operand and result types for these operations are probably why ScalarizeVecRes_StrictFPOp(...) didn't trip. Just guessing though...

kpn marked 2 inline comments as done.Feb 18 2019, 11:31 AM
kpn added inline comments.
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2846

Yes it would, assuming EmitStackConvert() works correctly for floating point types. A 'make check' of llvm with the default targets enabled doesn't trigger the call to DAG.getTruncStore(), so I'm not actually sure it works.

But none of the other intrinsics handle rounding modes currently, so I think we should leave this to future work.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
711

You know, I thought about it. But trying to generalize it would hide the parallel with ScalarizeVecRes_FP_ROUND(). Right now the connection between the two is obvious. I'm not sure that we'd gain much at this point in time by generalizing and losing that bit of readability.

If we do end up needing to generalize this code in the future then we can do it then.

kpn updated this revision to Diff 187265.Feb 18 2019, 11:34 AM

Address review comments.

Fix regression on WebAssembly accidentally caused by incorrect svn update.

jsji added a subscriber: jsji.Feb 25 2019, 10:42 AM
kpn updated this revision to Diff 191120.Mar 18 2019, 10:04 AM

Rebase. Ping.

craig.topper added inline comments.Mar 25 2019, 12:24 PM
docs/LangRef.rst
14511

This should be in a separate patch.

include/llvm/IR/Intrinsics.td
592

This line [ should probably be lined up with the line above. Same with fpext.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1784

This line doesn't make sense to me. This is replacing the users of result 1 of the load you just created with Chain. But no one has seen this load yet so how can its result 1 have any users?

2831

Shouldn't you be passing the input chain operand of Node into the last argument? Right now it looks like you're using the Chain result from Node itself. But we want to delete Node.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
713

Assert that OpNo is 1.

1530

use a temporary for the OpNo instead of of repeatedly calling N->isStrictFPOpcode()

1933

This comment must have been copy pasted. We aren't operating on a "load" here

4024

Add curly braces to the else and fix the indentation

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7682

Can this be Node->getValueType(0)? Accessing ValueList directly seems pretty unusual.

lib/IR/Verifier.cpp
4630–4712

Most of this looks like it belongs in a separate patch. This patch shoudl focus on adding the new intrinsics. Gaps in old intrinsics should be separated.

test/CodeGen/X86/vector-constrained-fp-intrinsics.ll
3

Do this as a pre-commit?

kpn marked an inline comment as done.Mar 26 2019, 7:55 AM
kpn added inline comments.
docs/LangRef.rst
14511

What should be in a separate patch? The new AddingConstrainedIntrinsics.rst file?

kpn marked an inline comment as not done.Mar 26 2019, 7:56 AM
kpn marked 4 inline comments as done.Apr 15 2019, 5:56 AM
kpn marked 6 inline comments as done.Apr 17 2019, 9:47 AM
kpn added inline comments.
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
1784

Agreed. I'm fixing it now.

2831

I'm now changing EmitStackConvert to splice itself into the chain. So passing Node is about to become correct since I need both ends of the chain to do the splicing.

kpn updated this revision to Diff 196690.Apr 25 2019, 11:18 AM

Address review comments.

I did find that I didn't need a new, special version of ReplaceNode(). One of the existing ones appears to do the job. That meant I could simplify the changes to EmitStackConvert() and avoid chain splicing there. This, in turn, means I only needed to pass in one end of the chain to EmitStackConvert(). I hope the result is more readable.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

Can this cause the wrong rounding mode to be used? For X86 targets I would guess it will result in an instruction that uses the runtime rounding mode, but I'm not sure we can count on that for all targets. Also, if the value being converted is a constant, might this get folded using the default rounding mode? And if it is a constant and we knew the rounding mode based on an argument to the intrinsic, we might want to fold it (though ideally that would have happened before this).

kpn added inline comments.Apr 26 2019, 7:56 AM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

Isn't this an argument for a strict load node types and maybe a strict store type as well?

I'd be surprised if we were doing constant folding of a store+load combination, but if we are then either a strict load or store node would disable that folding.

Strictly speaking, using the runtime rounding mode would be incorrect if that wasn't what was specified in the rounding mode field of the intrinsic. So even X86 could be wrong.

We aren't doing anything with the rounding and exception arguments yet for any of the new intrinsics. So how about I file a bug noting that fact and noting that we need to not forget about this case? Then we can move on and come back later.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

Actually, if the runtime rounding mode doesn't match the rounding argument for non-dynamic cases that's a user error or a bug somewhere upstream in the compiler. The rounding mode argument is supposed to tell us what the rounding mode is at this point in the program. It is not supposed to control the rounding mode. The non-dynamic rounding modes are essentially an "assume" kind of directive.

I see that the constant folding case I'm concerned about will actually be covered by one of your test cases, so I think we can safely move forward with this the way you have it.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
1530

I'd prefer to see this as:

unsigned OpNo = N->isStrictFPOpcode() ? 1 : 0;

1921

Again, the use of a boolean result as an index is awkward.

test/CodeGen/X86/fp-intrinsics.ll
303

This comment is wrong.

test/CodeGen/X86/vector-constrained-fp-intrinsics.ll
3

Expanding on Craig's comment here. I think he is suggesting that you add the AVX run line and all of the associated new checks as a separate patch before the fpext/fptrrunc patch lands.

kpn updated this revision to Diff 197134.Apr 29 2019, 9:49 AM
kpn marked 5 inline comments as done.

Address review comments.

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7664

It might be helpful to have a comment here explaining why STRICT_FP_ROUND isn't unary. That can be done in a separate change if everything else here is ready to commit. It all looks good to me, but I'm hoping Craig can give your widen vector changes one last look to make sure that's doing what he expected.

craig.topper added inline comments.Apr 29 2019, 12:12 PM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2835

This break is unreachable.

2850

This break is unreachable

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3042

I'm not sure this is safe. There's no way of knowing how the input was widened. I don't think you can access more than the original vector width worth of elements.

3054

I don't think this is safe either.

3995

Again I don't think this is safe. We don't what's in widened elements of the input.

test/CodeGen/X86/vector-constrained-fp-intrinsics.ll
2487

I think this is wrong. It's reading 4 elements, but we don't know what is in the 4th element.

2558

This is reading 4 float elements from memory. We should only be reading 2.

2584

This reads 4 floats from memory, but we should only read 3.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

Can this cause the wrong rounding mode to be used? For X86 targets I would guess it will result in an instruction that uses the runtime rounding mode, but I'm not sure we can count on that for all targets.

I think it's more severe than that. The details are not fresh in my mind, but IIRC EmitStackConvert(...) emits a straight truncate on X86 (well, at least for one case I looked at. Could be wrong too.). We probably shouldn't be calling it for a strict round, and rather go to an instruction that honors rounding mode.

kpn marked 2 inline comments as done.May 1 2019, 10:06 AM
kpn added inline comments.
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

My impression is that EmitStackConvert() is one of those functions that is only used as a last resort when an ISA lacks a good reg+reg instruction. So this is probably already handled for most cases.

If you know of a way to write a strict FP test case that uses it I'd love to see it. And if you have one then we should probably turn it into a library call to honor the rounding mode.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3042

I believe you are correct. I'm looking at this now.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

@cameron.mcinally , what do you mean by "a straight truncate"?

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

Nope, I made a mistake. EmitStackConvert(...) ends up creating a truncstore, which could end up as an FST. I had mistakenly thought that this rounded to nearest, but it looks like it uses the FPU control word.

Assuming that fenv.h sets the FPU control word correctly (I didn't check), it should be fine.

andrew.w.kaylor added inline comments.May 1 2019, 1:14 PM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2829

Regarding fenv.h, if you use the fesetround() function from fenv.h to set the rounding mode, it will update both MXCSR and FPCW. If you use an intrinsic or inline assembly or whatever to set one of them without the other, you're on your own (and fegetround() will fail if they have different settings).

kpn updated this revision to Diff 198856.May 9 2019, 9:42 AM

Address review comments: Add a comment as requested. Remove bogus optimizations to let the base case work. Update tests.

kpn updated this revision to Diff 198890.May 9 2019, 12:56 PM

Why does Verifier::visitIntrinsicCall() not choke when given an intrinsic it doesn't know about?

I added back the lines accidentally lost so the new intrinsics do now get checked.

kpn added a subscriber: kbarton.May 10 2019, 12:28 PM
craig.topper added inline comments.May 10 2019, 12:51 PM
include/llvm/IR/Intrinsics.td
672

A lot of this FIXME still applies doesn't it?

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3032

I think this is an unused variable

lib/IR/Verifier.cpp
4627

Indented too far

4629

Indented too far

include/llvm/IR/Intrinsics.td
672

We definitely need FCMP. I attempted a patch for this, D54649, but FCMP is Custom lowered on X86. We still need a good way to handle STRICT nodes that need to be Custom lowered.

Also FYI that the line below can be removed. We decided not to implement constrained FABS and FCOPYSIGN... unless a problem is found.

kpn updated this revision to Diff 199074.May 10 2019, 1:30 PM
kpn marked 4 inline comments as done.

Address review comments.

craig.topper accepted this revision.May 10 2019, 1:34 PM

LGTM with that one remaining unused variable fixed.

lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3031

Is InWidenVT also unused?

This revision is now accepted and ready to land.May 10 2019, 1:34 PM
kpn marked an inline comment as done.May 10 2019, 1:43 PM

Excellent. Thank you for all the reviews! I will commit this on Monday morning so I can keep an eye on the bots.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2019, 6:21 AM