Details
- Reviewers
andrew.w.kaylor craig.topper hfinkel mehdi_amini aemerson javed.absar - Commits
- rZORG635c65f4803c: Add constrained fptrunc and fpext intrinsics.
rZORG3cc1796eb454: Add constrained fptrunc and fpext intrinsics.
rG635c65f4803c: Add constrained fptrunc and fpext intrinsics.
rG3cc1796eb454: Add constrained fptrunc and fpext intrinsics.
rG5987749e33bb: Add constrained fptrunc and fpext intrinsics.
rL360581: Add constrained fptrunc and fpext intrinsics.
Diff Detail
Event Timeline
I just realized I missed a couple of changes. Let me work on those and I'll update later.
It was pointed out before I split this ticket out that I wasn't properly passing the correct operand in ExpandNode() when handling the two new STRICT nodes. I've corrected that.
docs/LangRef.rst | ||
---|---|---|
14836 | Hi, I've been working on implementing a Vulkan extension which allows the user to specifiy different rounding modes for the AMDGPU backend. I'm not sure how this works in C/C++, but we're required to support floating-point truncation with non-standard rounding modes. Is there a reason the rounding mode isn't an argument here? |
docs/LangRef.rst | ||
---|---|---|
14836 | Going back and rereading D43515, I don't see an explicit reason given back then. And I can't find anything in the C99 or IEEE 754 standards, or in the LLVM documentation, that would mandate any particular rounding mode. So I'm open to adding a rounding mode argument to the constrained fptrunc. Andrew? What do you think of making constrained fptrunc go back to taking a rounding mode argument? Having said that, the constrained FP intrinsics are to avoid optimizations that change program behavior taking traps into account. Is this the behavior you need for Vulcan? |
docs/LangRef.rst | ||
---|---|---|
14836 | Vulkan doesn't support trapping floating-point exceptions, so we don't have to worry about that. However, my understanding is that we still need to communicate the rounding mode to LLVM, to prevent it from constant folding floating-point operations with the wrong rounding mode, so we still need to use the intrinsics. There's also the complication that we may need to emit some "internal" floating-point operations which need to have some defined rounding mode different from what the user specified. The backend has total control over the control register that specifies the rounding mode, and in some cases (e.g. fp32 -> fp16 truncation) the rounding mode is actually specified statically by the instruction itself rather than the control register, so my thought was that we could make AMDGPU just always use the rounding specified by the argument to the constrained intrinsic, emitting changes to the control register if necessary. I'm not sure if the CodeGen infrastructure is set up to do that. |
docs/LangRef.rst | ||
---|---|---|
14836 | Oh, and I forgot another thing: the extension also adds support for letting the user either flush or preserve denormalized values. However, this is per-source-module, and sometimes we need to stitch together multiple source modules which have different rounding needs, emitting an instruction in between them to change the denorm flushing and/or rounding mode. So it seems we really do need to use the constrained intrinsics, to prevent code motion of floating-point operations around that register setting. |
Add back a rounding mode argument to constrained fptrunc as requested by Connor Abbott.
docs/LangRef.rst | ||
---|---|---|
14836 | We should definitely have a rounding mode argument for fptrunc. I think the reason we missed that the first time around is probably due to the unfortunate naming of this operation (i.e. it isn't actually truncating) and the confusion with ISD::ROUND. |
This is looking pretty good. I don't think I know the Selection DAG well enough to offer a proper review of that. I'll see if I can get Craig's attention on it.
docs/index.rst | ||
---|---|---|
194 ↗ | (On Diff #184127) | This seems a bit too prominently placed. Most people don't care about these intrinsics. I would recommend sinking this down into the subsytem documentation section. There is no clear organization there, so it's hard to say where it should go. Maybe just above or below the exception handling section (just based on my perception of the generality of each). |
lib/IR/Verifier.cpp | ||
4673 | The formatting is non-standard and inconsistent in this section. | |
4688 | You could combine these two lines as: if (auto *VecTy = dyn_cast<VectorType>(OperandTy)) | |
4699 | Redefining "Operand" here makes the code confusing. I'd rather see Operand and Result as separate variables and make local variables for their types. I would also make the Assert statements more specific to what they are actually checking. For instance, if (OperandTy->isVectorTy()) { Assert(ResultTy->isVectorTy(), ... Assert(OperandTy->getVectorNumElements() == ResultTy->getVectorNumElements(), ... } Is there a reason that these vector checks are specific to fptrunc and fpext? Is it just because they don't have a "same type" restriction in the intrinsic definition? The floating point type assertions could be simplified as Assert(OperandTy->isFPorFPVectorTy(),... You should also be checking that OperandTy->getScalarSizeInBits() > ResultTy->getScalarSizeInBits for fptrunc and vice versa for fpext. |
test/CodeGen/X86/vector-constrained-fp-intrinsics.ll | ||
---|---|---|
3974 | Please add an AVX command line so v4f64 will be a legal type. |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
772 | Test case CodeGen/AArch64/neon-fpround_f128.ll says this code is needed. That case goes through the non-STRICT version of this function. This STRICT function was copied from the non-STRICT function and called in the appropriate places alongside that function. And a trivial conversion of the test case to use the constrained intrinsics does indeed go through this STRICT function. I did want to try to keep the functions unified, but sometimes the result was too ugly to live. | |
3242 | Is that what getUNDEF() does? Give llvm license to put garbage in registers? My assumption is that the code is fine because this function is a copy of WidenVecRes_Convert() with the needed changes for the strict node being chained. If if there's a problem here there's also a problem in that function. | |
lib/IR/Verifier.cpp | ||
4688 | The rewrite moots this point. | |
4699 | I've rewritten this code and with your suggestions it does look much nicer. I think I did put the vector checks are for fptrunc and fpext because they aren't checked earlier. I've also added the checks for the appropriate changes in ScalarSizeInBits. |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
3242 | That is what the undef means. The existing code isn't required to be exception safe so garbage is fine. The constrained intrinsics have to be exception safe. This is why the implementation of WidenVecRes_StrictFP is different than the non-trapping case in WidenVecRes_BinaryCanTrap. I believe FADD/FSUB/FDIV/FMUL are considered non-trapping on most targets. |
I'm seeing a regression in WebAssembly/PR40267.ll that I need to look into. If anyone else is seeing this failure let me know.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
3242 | It looks like this "WidenNumElts % InVTNumElts" block can just be eliminated. The fallback at the end of the function should handle the case without any undefs. |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2803 | Would this lowering to EmitStackConvert(...) discard the rounding mode? I'm not familiar with this code, but I *think* it would. E.g. STRICT_FP_ROUND would lower to a truncating store. |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
772 | Have you considered scalarizing the operands with a generic function? Something like ScalarizeVecRes_StrictFPOp(...)? I suspect we'll see reuse of this code.
The different operand and result types for these operations are probably why ScalarizeVecRes_StrictFPOp(...) didn't trip. Just guessing though... |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2803 | Yes it would, assuming EmitStackConvert() works correctly for floating point types. A 'make check' of llvm with the default targets enabled doesn't trigger the call to DAG.getTruncStore(), so I'm not actually sure it works. But none of the other intrinsics handle rounding modes currently, so I think we should leave this to future work. | |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
772 | You know, I thought about it. But trying to generalize it would hide the parallel with ScalarizeVecRes_FP_ROUND(). Right now the connection between the two is obvious. I'm not sure that we'd gain much at this point in time by generalizing and losing that bit of readability. If we do end up needing to generalize this code in the future then we can do it then. |
Address review comments.
Fix regression on WebAssembly accidentally caused by incorrect svn update.
docs/LangRef.rst | ||
---|---|---|
14818 | This should be in a separate patch. | |
include/llvm/IR/Intrinsics.td | ||
611 | This line [ should probably be lined up with the line above. Same with fpext. | |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
1786 | This line doesn't make sense to me. This is replacing the users of result 1 of the load you just created with Chain. But no one has seen this load yet so how can its result 1 have any users? | |
2789 | Shouldn't you be passing the input chain operand of Node into the last argument? Right now it looks like you're using the Chain result from Node itself. But we want to delete Node. | |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
774 | Assert that OpNo is 1. | |
1651–1652 | use a temporary for the OpNo instead of of repeatedly calling N->isStrictFPOpcode() | |
2055 | This comment must have been copy pasted. We aren't operating on a "load" here | |
4178–4179 | Add curly braces to the else and fix the indentation | |
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
7731 | Can this be Node->getValueType(0)? Accessing ValueList directly seems pretty unusual. | |
lib/IR/Verifier.cpp | ||
4644–4746 | Most of this looks like it belongs in a separate patch. This patch shoudl focus on adding the new intrinsics. Gaps in old intrinsics should be separated. | |
test/CodeGen/X86/vector-constrained-fp-intrinsics.ll | ||
3 | Do this as a pre-commit? |
docs/LangRef.rst | ||
---|---|---|
14818 | What should be in a separate patch? The new AddingConstrainedIntrinsics.rst file? |
Address review comments.
I did find that I didn't need a new, special version of ReplaceNode(). One of the existing ones appears to do the job. That meant I could simplify the changes to EmitStackConvert() and avoid chain splicing there. This, in turn, means I only needed to pass in one end of the chain to EmitStackConvert(). I hope the result is more readable.
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 | Can this cause the wrong rounding mode to be used? For X86 targets I would guess it will result in an instruction that uses the runtime rounding mode, but I'm not sure we can count on that for all targets. Also, if the value being converted is a constant, might this get folded using the default rounding mode? And if it is a constant and we knew the rounding mode based on an argument to the intrinsic, we might want to fold it (though ideally that would have happened before this). |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 | Isn't this an argument for a strict load node types and maybe a strict store type as well? I'd be surprised if we were doing constant folding of a store+load combination, but if we are then either a strict load or store node would disable that folding. Strictly speaking, using the runtime rounding mode would be incorrect if that wasn't what was specified in the rounding mode field of the intrinsic. So even X86 could be wrong. We aren't doing anything with the rounding and exception arguments yet for any of the new intrinsics. So how about I file a bug noting that fact and noting that we need to not forget about this case? Then we can move on and come back later. |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 | Actually, if the runtime rounding mode doesn't match the rounding argument for non-dynamic cases that's a user error or a bug somewhere upstream in the compiler. The rounding mode argument is supposed to tell us what the rounding mode is at this point in the program. It is not supposed to control the rounding mode. The non-dynamic rounding modes are essentially an "assume" kind of directive. I see that the constant folding case I'm concerned about will actually be covered by one of your test cases, so I think we can safely move forward with this the way you have it. | |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
1651–1652 | I'd prefer to see this as: unsigned OpNo = N->isStrictFPOpcode() ? 1 : 0; | |
2043 | Again, the use of a boolean result as an index is awkward. | |
test/CodeGen/X86/fp-intrinsics.ll | ||
303 | This comment is wrong. | |
test/CodeGen/X86/vector-constrained-fp-intrinsics.ll | ||
3 | Expanding on Craig's comment here. I think he is suggesting that you add the AVX run line and all of the associated new checks as a separate patch before the fpext/fptrrunc patch lands. |
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
7711 | It might be helpful to have a comment here explaining why STRICT_FP_ROUND isn't unary. That can be done in a separate change if everything else here is ready to commit. It all looks good to me, but I'm hoping Craig can give your widen vector changes one last look to make sure that's doing what he expected. |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2793 | This break is unreachable. | |
2807 | This break is unreachable | |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
3225 | I'm not sure this is safe. There's no way of knowing how the input was widened. I don't think you can access more than the original vector width worth of elements. | |
3237 | I don't think this is safe either. | |
4162 | Again I don't think this is safe. We don't what's in widened elements of the input. | |
test/CodeGen/X86/vector-constrained-fp-intrinsics.ll | ||
3895 | I think this is wrong. It's reading 4 elements, but we don't know what is in the 4th element. | |
3966 | This is reading 4 float elements from memory. We should only be reading 2. | |
3992 | This reads 4 floats from memory, but we should only read 3. |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 |
I think it's more severe than that. The details are not fresh in my mind, but IIRC EmitStackConvert(...) emits a straight truncate on X86 (well, at least for one case I looked at. Could be wrong too.). We probably shouldn't be calling it for a strict round, and rather go to an instruction that honors rounding mode. |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 | My impression is that EmitStackConvert() is one of those functions that is only used as a last resort when an ISA lacks a good reg+reg instruction. So this is probably already handled for most cases. If you know of a way to write a strict FP test case that uses it I'd love to see it. And if you have one then we should probably turn it into a library call to honor the rounding mode. | |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
3225 | I believe you are correct. I'm looking at this now. |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 | @cameron.mcinally , what do you mean by "a straight truncate"? |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 | Nope, I made a mistake. EmitStackConvert(...) ends up creating a truncstore, which could end up as an FST. I had mistakenly thought that this rounded to nearest, but it looks like it uses the FPU control word. Assuming that fenv.h sets the FPU control word correctly (I didn't check), it should be fine. |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2787 | Regarding fenv.h, if you use the fesetround() function from fenv.h to set the rounding mode, it will update both MXCSR and FPCW. If you use an intrinsic or inline assembly or whatever to set one of them without the other, you're on your own (and fegetround() will fail if they have different settings). |
Address review comments: Add a comment as requested. Remove bogus optimizations to let the base case work. Update tests.
Why does Verifier::visitIntrinsicCall() not choke when given an intrinsic it doesn't know about?
I added back the lines accidentally lost so the new intrinsics do now get checked.
include/llvm/IR/Intrinsics.td | ||
---|---|---|
691 | We definitely need FCMP. I attempted a patch for this, D54649, but FCMP is Custom lowered on X86. We still need a good way to handle STRICT nodes that need to be Custom lowered. Also FYI that the line below can be removed. We decided not to implement constrained FABS and FCOPYSIGN... unless a problem is found. |
LGTM with that one remaining unused variable fixed.
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
3214 | Is InWidenVT also unused? |
Excellent. Thank you for all the reviews! I will commit this on Monday morning so I can keep an eye on the bots.
This should be in a separate patch.