Page MenuHomePhabricator

[AMDGPU] Add two new intrinsics to control fp_trunc rounding mode
Needs ReviewPublic

Authored by jpages on Sep 27 2021, 1:17 PM.

Details

Summary

Add two new intrinsics to precisely control the rounding mode when converting
from f32 to f16.

Diff Detail

Event Timeline

jpages created this revision.Sep 27 2021, 1:17 PM
jpages requested review of this revision.Sep 27 2021, 1:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2021, 1:17 PM
dstuttard added inline comments.Sep 28 2021, 1:29 AM
llvm/test/CodeGen/AMDGPU/llvm.experimental.fptrunc.round.ll
39

Is there a reason why we couldn't make this a single s_round_mode 0x8 rather than going via s_round_mode 0x0?

foad added a comment.Sep 28 2021, 2:58 AM

I'm not sure what level of approval is needed for new experimental intrinsics. It might be worth emailing llvmdev to explain the requirement. In any case they need to be documented in docs/LangRef.rst.

You're using EmitInstrWithCustomInserter to emit s_round_mode instructions, and then adding a new phase to SIModeRegister to remove them. This seems wrong. You should be able to teach SIModeRegister to insert the required s_round_mode instructions.

llvm/include/llvm/IR/Intrinsics.td
905

Needs a comment.

908

Needs a comment.

llvm/lib/Target/AMDGPU/SIInstructions.td
182–183

Can you implement instruction selection for the intrinsics by adding patterns here, instead of writing C++ code?

jpages updated this revision to Diff 377017.Oct 4 2021, 1:16 PM

Rebased based on previous comments.

The pseudo-instructions are now selected in the pass SIModeRegister.

jpages marked 4 inline comments as done.Oct 4 2021, 1:17 PM
foad added inline comments.Oct 6 2021, 6:00 AM
llvm/lib/Target/AMDGPU/SIModeRegister.cpp
429

You shouldn't need to add a new pass here to insert s_round_mode instructions. The SIModeRegister pass already knows how to do that. You should just be able to add your new PSEUDOs to the switch in getInstructionMode.

jpages updated this revision to Diff 378283.Oct 8 2021, 9:48 AM

Rebased based on Jay's comments.

The consequence of this change is the use of setreg instead of s_round_mode in the codegen. But it's probably better to not reinvent the wheel as this pass is already optimized to not insert too many setreg.

jpages marked an inline comment as done.Oct 8 2021, 9:49 AM
foad added a comment.Oct 11 2021, 1:41 AM

The consequence of this change is the use of setreg instead of s_round_mode in the codegen. But it's probably better to not reinvent the wheel as this pass is already optimized to not insert too many setreg.

Good point. s_round_mode/s_denorm_mode are new in GFX10, so they did not exist when this pass was written. Do you think the pass could be improved to emit s_round_mode/s_denorm_mode instead of s_setreg whenever it only needs to change the rounding/denormal bits of the mode register? That could be a separate patch.

foad added inline comments.Oct 11 2021, 1:47 AM
llvm/lib/Target/AMDGPU/SIModeRegister.cpp
440

If you arrange for the pseudo to have exactly the same operands as the real instruction, then you don't need to build or delete instructions, all you need to do is change the opcode, which you can do with MI.setDesc().

443

I'm not sure this needs another pass over all instructions in the function. Would it be cleaner to do it in phase 1? Maybe even inside getInstructionMode???

arsenm added inline comments.Oct 19 2021, 3:04 PM
llvm/include/llvm/IR/Intrinsics.td
914

Would it be better to have a metadata argument for the rounding mode like the constrained intrinsics? The verifier could disallow the unknown mode type

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4493–4495

The backend should not directly consume generic intrinsics. These need to be routed through a generic opcode which is subject to normal legalization

jpages added inline comments.Oct 20 2021, 12:12 PM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4493–4495

What is the reasoning behind this requirement?

I guess I would need something like experimental_fptrunc_round_upward -> amdgpu_experimental_fptrunc_round_upward -> codegen?

llvm/lib/Target/AMDGPU/SIModeRegister.cpp
440

I think this could work. I tried to "replace" my pseudo-instructions in Phase1 but it's not a good idea to add/delete instructions in the middle of an iteration on the basic block.

The consequence of this change is the use of setreg instead of s_round_mode in the codegen. But it's probably better to not reinvent the wheel as this pass is already optimized to not insert too many setreg.

Good point. s_round_mode/s_denorm_mode are new in GFX10, so they did not exist when this pass was written. Do you think the pass could be improved to emit s_round_mode/s_denorm_mode instead of s_setreg whenever it only needs to change the rounding/denormal bits of the mode register? That could be a separate patch.

Sure, that's a good suggestion. I'll try to add this in a different patch.

arsenm added inline comments.Oct 20 2021, 12:38 PM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
4493–4495

No, you would just need G_FPTRUNC_ROUND_UPWARD etc. opcodes. We don't have a way to define legalization rules on intrinsics, so if you need type legalization the backend would have to take care of it manually. e.g. we would have to manually scalarize the vector overloads of these intrinsics. With the opcode you can just define legalization rules like normal

jpages updated this revision to Diff 391139.Wed, Dec 1, 2:56 PM
jpages set the repository for this revision to rG LLVM Github Monorepo.

Rebased.

Added G_FPTRUNC_ROUND_UPWARD/G_FPTRUNC_ROUND_DOWNWARD opcodes between the intrinsics and the pseudo instructions. Added a custom ISD node for the DAG version.

It may be possible to have a simpler implementation for intrinsics -> G_FPTRUNC_ROUND_UPWARD -> pseudos, I'm open to suggestions.

Arranged the pseudos to have exactly the same operands as the v_cvt instruction like Jay suggested.

sepavloff added inline comments.Wed, Dec 1, 9:27 PM
llvm/include/llvm/IR/Intrinsics.td
910

What is the difference between this function and llvm.ceil? Can the latter be used instead of this one?

914

What is the difference between this function and llvm.floor?

There are also functions llvm.nearbyint and llvm.rint that can be used to make rounding with any mode.

sepavloff added inline comments.Wed, Dec 1, 9:32 PM
llvm/include/llvm/IR/Intrinsics.td
914

Oh, I see, this is conversion between floats.

There is llvm.experimental.constrained.fptrunc, which seems to do the same thing?

jpages added inline comments.Thu, Dec 2, 9:35 AM
llvm/include/llvm/IR/Intrinsics.td
914

Yes, this is indeed a conversion between floats.

The idea of this change is to introduce something simpler than constrained intrinsics in a special case.

According to https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics "If any FP operation in a function is constrained then they all must be constrained".

We wanted to do a simpler conversion from f32 -> f16 with a special rounding mode, but without constraining the rest of the function after this operation. Introducing an Intrinsic that would be lowered to 3 instructions seemed to be the best solution.

These 3 instructions are:

  • Setting a special rounding mode
  • Conversion f32 -> f16
  • Restoring the rounding mode to the default value
sepavloff added inline comments.Thu, Dec 2, 7:02 PM
llvm/include/llvm/IR/Intrinsics.td
914

This would be a useful function, which solves a widespread task. On some platforms, like RISC-V, it even can be lowered into a single instruction.

As this is common interface, I would propose you to make rounding mode an argument. It would allows a target to implement the conversion with any supported mode. Besides, it would allow future extensions without changing the interface, for example, to allow other control modes:

lvm.fptrunc.round(%a, !"round.towardzero,denorm.ftz")

What prevents from making it a regular intrinsic, not experimental?