This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add a new intrinsic to control fp_trunc rounding mode
ClosedPublic

Authored by jpages on Sep 27 2021, 1:17 PM.

Details

Summary

Add a new intrinsic to precisely control the rounding mode when converting
from f32 to f16.

Diff Detail

Event Timeline

jpages created this revision.Sep 27 2021, 1:17 PM
jpages requested review of this revision.Sep 27 2021, 1:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2021, 1:17 PM
dstuttard added inline comments.Sep 28 2021, 1:29 AM
llvm/test/CodeGen/AMDGPU/llvm.experimental.fptrunc.round.ll
38 ↗(On Diff #375382)

Is there a reason why we couldn't make this a single s_round_mode 0x8 rather than going via s_round_mode 0x0?

foad added a comment.Sep 28 2021, 2:58 AM

I'm not sure what level of approval is needed for new experimental intrinsics. It might be worth emailing llvmdev to explain the requirement. In any case they need to be documented in docs/LangRef.rst.

You're using EmitInstrWithCustomInserter to emit s_round_mode instructions, and then adding a new phase to SIModeRegister to remove them. This seems wrong. You should be able to teach SIModeRegister to insert the required s_round_mode instructions.

llvm/include/llvm/IR/Intrinsics.td
905

Needs a comment.

908

Needs a comment.

llvm/lib/Target/AMDGPU/SIInstructions.td
182–183

Can you implement instruction selection for the intrinsics by adding patterns here, instead of writing C++ code?

jpages updated this revision to Diff 377017.Oct 4 2021, 1:16 PM

Rebased based on previous comments.

The pseudo-instructions are now selected in the pass SIModeRegister.

jpages marked 4 inline comments as done.Oct 4 2021, 1:17 PM
foad added inline comments.Oct 6 2021, 6:00 AM
llvm/lib/Target/AMDGPU/SIModeRegister.cpp
428

You shouldn't need to add a new pass here to insert s_round_mode instructions. The SIModeRegister pass already knows how to do that. You should just be able to add your new PSEUDOs to the switch in getInstructionMode.

jpages updated this revision to Diff 378283.Oct 8 2021, 9:48 AM

Rebased based on Jay's comments.

The consequence of this change is the use of setreg instead of s_round_mode in the codegen. But it's probably better to not reinvent the wheel as this pass is already optimized to not insert too many setreg.

jpages marked an inline comment as done.Oct 8 2021, 9:49 AM
foad added a comment.Oct 11 2021, 1:41 AM

The consequence of this change is the use of setreg instead of s_round_mode in the codegen. But it's probably better to not reinvent the wheel as this pass is already optimized to not insert too many setreg.

Good point. s_round_mode/s_denorm_mode are new in GFX10, so they did not exist when this pass was written. Do you think the pass could be improved to emit s_round_mode/s_denorm_mode instead of s_setreg whenever it only needs to change the rounding/denormal bits of the mode register? That could be a separate patch.

foad added inline comments.Oct 11 2021, 1:47 AM
llvm/lib/Target/AMDGPU/SIModeRegister.cpp
439

If you arrange for the pseudo to have exactly the same operands as the real instruction, then you don't need to build or delete instructions, all you need to do is change the opcode, which you can do with MI.setDesc().

442

I'm not sure this needs another pass over all instructions in the function. Would it be cleaner to do it in phase 1? Maybe even inside getInstructionMode???

arsenm added inline comments.Oct 19 2021, 3:04 PM
llvm/include/llvm/IR/Intrinsics.td
914

Would it be better to have a metadata argument for the rounding mode like the constrained intrinsics? The verifier could disallow the unknown mode type

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4493–4495

The backend should not directly consume generic intrinsics. These need to be routed through a generic opcode which is subject to normal legalization

jpages added inline comments.Oct 20 2021, 12:12 PM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4493–4495

What is the reasoning behind this requirement?

I guess I would need something like experimental_fptrunc_round_upward -> amdgpu_experimental_fptrunc_round_upward -> codegen?

llvm/lib/Target/AMDGPU/SIModeRegister.cpp
439

I think this could work. I tried to "replace" my pseudo-instructions in Phase1 but it's not a good idea to add/delete instructions in the middle of an iteration on the basic block.

The consequence of this change is the use of setreg instead of s_round_mode in the codegen. But it's probably better to not reinvent the wheel as this pass is already optimized to not insert too many setreg.

Good point. s_round_mode/s_denorm_mode are new in GFX10, so they did not exist when this pass was written. Do you think the pass could be improved to emit s_round_mode/s_denorm_mode instead of s_setreg whenever it only needs to change the rounding/denormal bits of the mode register? That could be a separate patch.

Sure, that's a good suggestion. I'll try to add this in a different patch.

arsenm added inline comments.Oct 20 2021, 12:38 PM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4493–4495

No, you would just need G_FPTRUNC_ROUND_UPWARD etc. opcodes. We don't have a way to define legalization rules on intrinsics, so if you need type legalization the backend would have to take care of it manually. e.g. we would have to manually scalarize the vector overloads of these intrinsics. With the opcode you can just define legalization rules like normal

jpages updated this revision to Diff 391139.Dec 1 2021, 2:56 PM
jpages set the repository for this revision to rG LLVM Github Monorepo.

Rebased.

Added G_FPTRUNC_ROUND_UPWARD/G_FPTRUNC_ROUND_DOWNWARD opcodes between the intrinsics and the pseudo instructions. Added a custom ISD node for the DAG version.

It may be possible to have a simpler implementation for intrinsics -> G_FPTRUNC_ROUND_UPWARD -> pseudos, I'm open to suggestions.

Arranged the pseudos to have exactly the same operands as the v_cvt instruction like Jay suggested.

sepavloff added inline comments.Dec 1 2021, 9:27 PM
llvm/include/llvm/IR/Intrinsics.td
910

What is the difference between this function and llvm.ceil? Can the latter be used instead of this one?

914

What is the difference between this function and llvm.floor?

There are also functions llvm.nearbyint and llvm.rint that can be used to make rounding with any mode.

sepavloff added inline comments.Dec 1 2021, 9:32 PM
llvm/include/llvm/IR/Intrinsics.td
914

Oh, I see, this is conversion between floats.

There is llvm.experimental.constrained.fptrunc, which seems to do the same thing?

jpages added inline comments.Dec 2 2021, 9:35 AM
llvm/include/llvm/IR/Intrinsics.td
914

Yes, this is indeed a conversion between floats.

The idea of this change is to introduce something simpler than constrained intrinsics in a special case.

According to https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics "If any FP operation in a function is constrained then they all must be constrained".

We wanted to do a simpler conversion from f32 -> f16 with a special rounding mode, but without constraining the rest of the function after this operation. Introducing an Intrinsic that would be lowered to 3 instructions seemed to be the best solution.

These 3 instructions are:

  • Setting a special rounding mode
  • Conversion f32 -> f16
  • Restoring the rounding mode to the default value
sepavloff added inline comments.Dec 2 2021, 7:02 PM
llvm/include/llvm/IR/Intrinsics.td
914

This would be a useful function, which solves a widespread task. On some platforms, like RISC-V, it even can be lowered into a single instruction.

As this is common interface, I would propose you to make rounding mode an argument. It would allows a target to implement the conversion with any supported mode. Besides, it would allow future extensions without changing the interface, for example, to allow other control modes:

lvm.fptrunc.round(%a, !"round.towardzero,denorm.ftz")

What prevents from making it a regular intrinsic, not experimental?

jpages updated this revision to Diff 399125.Jan 11 2022, 4:44 PM
jpages edited the summary of this revision. (Show Details)

Rebased.

I changed the patch to have only one intrinsic with a rounding mode parameter as suggested. This required many modifications in the frontend parts of LLVM.

Thanks for the comments!

craig.topper added inline comments.
llvm/include/llvm/IR/Intrinsics.td
907

Align the brackets with the previous line.

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
2261

Use cast instead of dyn_cast if it can't fail. Or check that the dyn_cast didn't return null if it can fail.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6337

Why would this intrinsic ever have a chain?

6349

Use cast instead of dyn_cast if it can't fail. Or check that the dyn_cast didn't return null if it can fail.

6384

The code has optionally added a chain, but INTRINSIC_W_CHAIN must always have a chain. Whether there is a chain or not is a property of the ISD opcode.

craig.topper added inline comments.Jan 11 2022, 5:11 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6337

Err. I guess the intrinsic attributes are defined to always have a Chain. Should it be IntrNoMem instead?

foad added inline comments.Jan 12 2022, 1:19 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6337

IIUC, if the rounding mode is allowed to be "dynamic" then the intrinsic needs to be IntrInaccessibleMemOnly (just like the constrained fp intrinsics). If we don't allow "dynamic" then it could be IntrNoMem.

Personally I have no interest in supporting "dynamic" so I would vote for disallowing it and making the intrinsic IntrNoMem.

jpages updated this revision to Diff 399768.Jan 13 2022, 1:40 PM

Rebased.

The intrinsic has the IntrNoMem attribute, and is represented by INTRINSIC_WO_CHAIN in the DAG.

jpages marked 4 inline comments as done.Jan 13 2022, 1:42 PM
jpages added inline comments.Jan 13 2022, 1:46 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6337

Yes the original version was defined to have a chain, I changed it with IntrNoMem/INTRINSIC_WO_CHAIN as suggested. Thank you for the suggestions.

As a consequence this code is a bit simpler, and the lowering in the backend was modified to reflect that .

foad added a comment.Jan 14 2022, 1:51 AM

It seems like people are mostly happy with the design now, so I am being a bit more picky with my review comments!

llvm/docs/LangRef.rst
23858

Add "... with a specified rounding mode". Apart from that I think most of the Overview/Arguments/Semantics text could be copied verbatim from the definition of the fptrunc-to instruction?

23868

Need to disallow "dynamic" if that's what we decided.

23874

Probably need to add the text from fptrunc-to: "This instruction is assumed to execute in the default floating-point environment" (because of the stuff about exceptions) and then say *except* for the rounding mode.

llvm/include/llvm/IR/Intrinsics.td
905

"Truncate"

908

Add IntrWillReturn.

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
2257

VRegs isn't used for anything so remove it.

2269

We know it's not void, and has exactly one result, so simplify this.

2273

We know it does not access memory.

llvm/lib/IR/Verifier.cpp
4748

MD could be nullptr here, so you need to handle that.

4750

Check that the rounding mode is not "dynamic"?

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
698 ↗(On Diff #399125)

Can this be done with patterns in the tablegen files instead of C++ code? Then they should work for both SelectionDAG and GlobalISel.

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3252 ↗(On Diff #399125)

Can this be done with patterns in the tablegen files instead of C++ code?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4939

Should this be "return false" to indicate a legalization failure in the normal way?

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4600

sgpr seems wrong. This is specifying the mapping for the result value and the input value (not the rounding mode which is no longer an operand). I think this should just use the same mapping as G_FPTRUNC (line 3640).

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
6844

I'm not quite sure but I think you should return SDValue() here to indicate a lowering/legalization failure?

llvm/lib/Target/AMDGPU/SIISelLowering.h
134 ↗(On Diff #399768)

No longer needed.

I think I'd like to see a target independent ISD opcode added for this. We don't use ISD::INTRINSIC_WO_CHAIN for target independent intrinsics.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6356

This is always false. And if it isn't you would only push one argument.

6361

I'd probably just use TLI.getPointerTy instead of MVT::i8.

6377

I know this was taken from visitTargetIntrinsic, but I don't really understand it.

6381

I don't think this is needed. AssertZExt isn't useful for an FP result.

6383

I don't think this alignment stuff is needed. The return type is FP so it probably doesn't do anything.

jpages updated this revision to Diff 404732.Jan 31 2022, 2:21 PM
jpages retitled this revision from [AMDGPU] Add two new intrinsics to control fp_trunc rounding mode to [AMDGPU] Add a new intrinsic to control fp_trunc rounding mode.
jpages marked 12 inline comments as done.EditedJan 31 2022, 2:29 PM

Rebased, as suggested I added a target independent ISD opcode.

Thanks again for all the comments!

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6377

I removed this line, I'm not sure of what it was doing either.

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3252 ↗(On Diff #399125)

Maybe it's possible, but I could not find a way to match both the G_FPTRUNC_ROUND_UPWARD and the selection DAG version.

So I let this one in C++.

llvm/lib/Target/AMDGPU/SIInstructions.td
182–183

Like I said previously, I tried to use pure tablegen for both SelectionDag and global-isel and I could not find a way to do it for both.

foad added inline comments.Feb 1 2022, 3:13 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3252 ↗(On Diff #399125)

You can get some clues about GlobalISel pattern problems by running tablegen with -warn-on-skipped-patterns like this:

$ llvm-tblgen -gen-global-isel -I lib/Target/AMDGPU -I build/include -I include -I lib/Target lib/Target/AMDGPU/AMDGPUGISel.td -warn-on-skipped-patterns
...
lib/Target/AMDGPU/SIInstructions.td:186:1: warning: Skipped pattern: Pattern operator lacks an equivalent Instruction (AMDGPUISD::FPTRUNC_ROUND_UPWARD)
def FPTRUNC_UPWARD_PSEUDO : VPseudoInstSI <(outs VGPR_32:$vdst),
^
...
lib/Target/AMDGPU/SIInstructions.td:190:1: warning: Skipped pattern: Pattern operator lacks an equivalent Instruction (AMDGPUISD::FPTRUNC_ROUND_DOWNWARD)
def FPTRUNC_DOWNWARD_PSEUDO : VPseudoInstSI <(outs VGPR_32:$vdst),
^

This means you need to add some GINodeEquiv lines in AMDGPUGISel.td.

llvm/lib/Target/AMDGPU/SIInstructions.td
188

Just end the line with ; if there is nothing to go inside the {}.

foad added a comment.Feb 1 2022, 4:02 AM

Here's what I came up with to get the globalisel pattern matching working, plus a few other suggested tweaks: https://reviews.llvm.org/differential/diff/404879/

jpages updated this revision to Diff 404944.Feb 1 2022, 8:26 AM

Rebased following Jay's comments.

jpages marked 2 inline comments as done.Feb 1 2022, 8:33 AM
jpages added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3252 ↗(On Diff #399125)

Thanks again, I didn't know about this command

jpages updated this revision to Diff 405819.Feb 3 2022, 3:32 PM
jpages marked an inline comment as done.

Updated a failing test on AArch64.

craig.topper added inline comments.Feb 3 2022, 3:47 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6355

You probably don't need ComputeValueVTs. There shouldn't be any structs or arrays here that need to broken down. I think you can do

EVT VT = TLI.getValueType(I.getType)

And then use VT in place of VTs

jpages updated this revision to Diff 405846.Feb 3 2022, 6:11 PM
jpages marked 3 inline comments as done.
jpages added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6355

Indeed it works, thanks!

foad added a comment.Feb 4 2022, 1:41 AM

This is looking pretty good, just another round of minor comments.

llvm/docs/LangRef.rst
23870

This seems to be repeating the "must be larger" text from a few lines above.

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
2270

return true.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6337

You don't really need a SmallVector here, since you know you have exactly two operands, you could just pass them as two separate arguments to the DAG.getNode call. But it's up to you.

6341

Shouldn't this be getArgOperand(1)? How does it work?

6345

This isn't used.

6346

TLI is already available in this function.

6359

All intrinsics returning floating point type are classed as FPMathOperators, so you can do this part unconditionally.

6364

Use sdl here too.

llvm/lib/IR/Verifier.cpp
4748

I am not an expert on verifying metadata but I think you probably need to check that it is an MDString here, otherwise the cast<MDString> below could fail? And really we ought to have some tests in test/Verifier/llvm.fptrunc.round.ll.

4752

I am not sure but I think RoundMode != RoundingMode::Dynamic needs to be RoundMode.getValue() != ... (or *RoundMode != ...).

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4662

Do you also need to check that the operand is MVT::f32? Maybe add a test where the operand is f64, and check that it fails to legalize?

jpages updated this revision to Diff 406482.Feb 7 2022, 8:50 AM
jpages marked an inline comment as done.

Rebased following Jay's comments.

Added some tests for the verifier/legalizer.

jpages marked 12 inline comments as done.Feb 7 2022, 9:00 AM
foad added a comment.Feb 8 2022, 2:42 AM

I think this looks good now. Are you happy with it @craig.topper @sepavloff ?

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
2258

Should probably use getArgOperand here, though it doesn't make any difference except for bounds checking the index that you pass in.

llvm/lib/IR/Verifier.cpp
4750

isa is a bit more natural than dyn_cast here.

llvm/test/CodeGen/AMDGPU/fail.llvm.fptrunc.round.ll
2

Does this test work in a Release build? If not, you can add a "REQUIRES: asserts" line just after the RUN lines.

sepavloff accepted this revision.Feb 8 2022, 8:24 AM

It looks good to me.

I had a look at generic changes only, someone with expertise in AMDGPU should also approve this patch.

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
2268

Flags could be set in the call to MIRBuilder.buildInstr.

This revision is now accepted and ready to land.Feb 8 2022, 8:24 AM
craig.topper added inline comments.Feb 8 2022, 9:06 AM
llvm/docs/LangRef.rst
23879

Should we document that this isn't supported on all targets, and that a particular target may not support all rounding modes?

jpages updated this revision to Diff 406953.Feb 8 2022, 1:35 PM
jpages marked 5 inline comments as done.Feb 8 2022, 1:42 PM
jpages added inline comments.
llvm/docs/LangRef.rst
23879

Good idea, thanks! I added a comment about that

llvm/test/CodeGen/AMDGPU/fail.llvm.fptrunc.round.ll
2

I just checked and it's working as well in Release without that.

foad accepted this revision.Feb 9 2022, 4:35 AM
foad added inline comments.
llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
2267

You can also do this on the end of the buildInstr line if you prefer, no need for MIB at all.

jpages updated this revision to Diff 407722.Feb 10 2022, 4:30 PM
jpages marked 2 inline comments as done.
jpages updated this revision to Diff 407733.Feb 10 2022, 5:33 PM
foad accepted this revision.Feb 11 2022, 12:11 AM
This revision was landed with ongoing or failed builds.Feb 11 2022, 9:09 AM
This revision was automatically updated to reflect the committed changes.