Page MenuHomePhabricator

[RISCV] Set dependency on floating point CSRs, 1/3
Needs ReviewPublic

Authored by sepavloff on Jan 6 2021, 2:31 AM.

Details

Summary

There are dependencies between floating point instructions that were
missed from the target description. They involve special registers that
keep exception flags and rounding mode. Most FP instructions can set
accrued exception flags, so they are implicit definitions for 'fflags'.
Instructions that use dynamic rounding mode depend on the content of
'frm', they represent implicit uses of this register. These dependencies
impose restrictions on the ordering of FP instructions and must be
provided to the compiler.

In general there can be 4 variants of an instruction depending on
whether it depends on frm and whether the changes of fflags should
be ignored. The two most important of them are:

  • defines 'fflags', depends on 'frm';
  • does not depend on frm, changes of fflags should be ignored.

The first one is the general case, so it must be supported in any case.
The second corresponds to default FP environment, which is the most
widespread case. Other two might be beneficial in some cases, but are
expected to be rarer.

This change defines these two variants for relevant FP instructions. It
is split into several parts to facilitate review. This part implements
the dependency for instructions with three input registers.

Diff Detail

Event Timeline

sepavloff created this revision.Jan 6 2021, 2:31 AM
sepavloff requested review of this revision.Jan 6 2021, 2:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2021, 2:31 AM
Herald added a subscriber: MaskRay. · View Herald Transcript

I still don't understand why the existence of static rounding modes in the ISA requires that we have to use them for the default environment. X86 doesn't have static rounding mode prior to AVX512 so uses dynamic in the default mode.

Do any other targets create 3 variants of instructions like this? X86 definitely doesn't. What makes RISC-V special that it needs to go this extreme?

I still don't understand why the existence of static rounding modes in the ISA requires that we have to use them for the default environment. X86 doesn't have static rounding mode prior to AVX512 so uses dynamic in the default mode.

It is more convenient. Instructions with static rounding mode do not depend on frm so they may be scheduled more freely. Besides function with static only FP instructions may be safely called from non-default FP environment. Targets without static rounding mode don't have such possibility.

Do any other targets create 3 variants of instructions like this? X86 definitely doesn't. What makes RISC-V special that it needs to go this extreme?

It must be a target that supports both static and dynamic rounding mode. Are there other targets in officially llvm repository that support both of them?

I still don't understand why the existence of static rounding modes in the ISA requires that we have to use them for the default environment. X86 doesn't have static rounding mode prior to AVX512 so uses dynamic in the default mode.

It is more convenient. Instructions with static rounding mode do not depend on frm so they may be scheduled more freely. Besides function with static only FP instructions may be safely called from non-default FP environment. Targets without static rounding mode don't have such possibility.

If there’s no write to frm then there shouldn’t be a scheduling issue. Can you demonstrate such an issue on a target without static rounding mode?

For the exception side I thought we already solved the scheduling issue with the mayRaiseFPException bit that is treated differently than using a register def of the fflags register.

Do any other targets create 3 variants of instructions like this? X86 definitely doesn't. What makes RISC-V special that it needs to go this extreme?

It must be a target that supports both static and dynamic rounding mode. Are there other targets in officially llvm repository that support both of them?

I still don't understand why the existence of static rounding modes in the ISA requires that we have to use them for the default environment. X86 doesn't have static rounding mode prior to AVX512 so uses dynamic in the default mode.

It is more convenient. Instructions with static rounding mode do not depend on frm so they may be scheduled more freely. Besides function with static only FP instructions may be safely called from non-default FP environment. Targets without static rounding mode don't have such possibility.

If there’s no write to frm then there shouldn’t be a scheduling issue.

Sure. Such issue rises when there is write to frm. Consider the following pseudo code:

float a = ...
for (int i = ...) {
  fesetround(FE_TOWARDZERO); // csrw frm, 1
  ...
  x[i] += floor(a); // fcvt ..., rdn

floor(a) is a loop invariant and could be hoisted off the loop. It is possible as fcvt uses static rounding. However if fcvt uses dynamic rounding, it depends on frm, which is changed above, so it cannot be moved out of the loop.

I still don't understand why the existence of static rounding modes in the ISA requires that we have to use them for the default environment. X86 doesn't have static rounding mode prior to AVX512 so uses dynamic in the default mode.

It is more convenient. Instructions with static rounding mode do not depend on frm so they may be scheduled more freely. Besides function with static only FP instructions may be safely called from non-default FP environment. Targets without static rounding mode don't have such possibility.

If there’s no write to frm then there shouldn’t be a scheduling issue.

Sure. Such issue rises when there is write to frm. Consider the following pseudo code:

float a = ...
for (int i = ...) {
  fesetround(FE_TOWARDZERO); // csrw frm, 1
  ...
  x[i] += floor(a); // fcvt ..., rdn

floor(a) is a loop invariant and could be hoisted off the loop. It is possible as fcvt uses static rounding. However if fcvt uses dynamic rounding, it depends on frm, which is changed above, so it cannot be moved out of the loop.

Why wouldn't that have been hoisted out of the loop by IR LICM? Machine LICM is primarily intended to move stack reloads and constant pool loads. It only runs on the outermost loop with a preheader.

I still don't understand why the existence of static rounding modes in the ISA requires that we have to use them for the default environment. X86 doesn't have static rounding mode prior to AVX512 so uses dynamic in the default mode.

It is more convenient. Instructions with static rounding mode do not depend on frm so they may be scheduled more freely. Besides function with static only FP instructions may be safely called from non-default FP environment. Targets without static rounding mode don't have such possibility.

If there’s no write to frm then there shouldn’t be a scheduling issue.

Sure. Such issue rises when there is write to frm. Consider the following pseudo code:

float a = ...
for (int i = ...) {
  fesetround(FE_TOWARDZERO); // csrw frm, 1
  ...
  x[i] += floor(a); // fcvt ..., rdn

floor(a) is a loop invariant and could be hoisted off the loop. It is possible as fcvt uses static rounding. However if fcvt uses dynamic rounding, it depends on frm, which is changed above, so it cannot be moved out of the loop.

Why wouldn't that have been hoisted out of the loop by IR LICM? Machine LICM is primarily intended to move stack reloads and constant pool loads. It only runs on the outermost loop with a preheader.

This is another example:

%1:fpr32 = …
%2:fpr32 = …
...
csrw frm, rdn
…
%3:fpr32 = FADD_S killed %1:fpr32, killed %2:fpr32, rne

In this code scheduler can move the instruction FADD_S upward, live ranges of %1 and %2 becomes shorter and register pressure decreases. If FADD instruction implicitly depends on frm, the scheduler cannot move FADD_S above csrw, so such optimization is not possible.

sepavloff updated this revision to Diff 326636.Feb 26 2021, 2:35 AM

Reduced number of instruction variants from 3 to 2 (generic and default)

sepavloff edited the summary of this revision. (Show Details)Feb 26 2021, 2:37 AM
lenary removed a subscriber: lenary.Feb 26 2021, 2:59 AM
asb added a comment.Mar 4 2021, 8:27 AM

We discussed this briefly in the RISC-V call as I noted this patchset has been sat open for some time. One thing that might be helpful is whether you could say a little bit more about the goal for this patchset. Once this lands, what's the next step? Is there some relevant RFC, or equivalent changes being made to other in-tree architectures?

We're also still unclear about the advantage of changing codegen to default to a static rounding mode (which might be a surprising change, as all software compiled to date on both GCC and LLVM has used used the dynamic rounding mode by default).

In D94163#2603738, @asb wrote:

We discussed this briefly in the RISC-V call as I noted this patchset has been sat open for some time. One thing that might be helpful is whether you could say a little bit more about the goal for this patchset.

Most floating point instructions set accrued exception bits in fflags register. If an instruction is specified with dynamic rounding mode, it also depends on the content of frm register. So in the following code:

csrwi  frm, a1
fadd.d ft2, ft2, ft3

changing the order of instruction is not allowed, because fadd.d depends on the content of frm, which is changed by the previous instruction. Similarly, the code:

fadd.d ft2, ft2, ft3
csrrs t0, fflags, zero

does not allow to change the order of the instructions, as crsrs reads content of fflags, which is set by the first instruction.

Now nothing prevents the compiler from changing the order of instructions in these examples. Existing instruction definitions are unable to express these dependencies. It does not allow to write programs that use non-default floating point environment, for example, dynamic rounding mode.

The goal of this patchset is to establish means to express such dependencies.

Once this lands, what's the next step?

This is the first and necessary step toward full-fledged implementation of floating point support. In particular it would allow progress in D91242 and D90854. Actually this work was undertaken because the lack of dependencies prevents implementation of llvm.set.rounding (https://reviews.llvm.org/D91242#2400476). The next steps, of course, include support of constrained intrinsics for RSCV.

Is there some relevant RFC,

If RFC can facilitate review of this patchset, I will prepare it.

or equivalent changes being made to other in-tree architectures?

Targets that support full-fledged floating point operations add implicit uses and definitions of FP state and control register(s). For example:
• X86: https://github.com/llvm/llvm-project/blob/bc172e532a89754d47fef1306064a26a4dc0a76b/llvm/lib/Target/X86/X86InstrFPStack.td#L728 (see let Defs = [FPSW] and let Uses = [FPCW]),
• PowerPC: https://github.com/llvm/llvm-project/blob/e7361c8eccb7663146096622549dc03240414157/llvm/lib/Target/PowerPC/PPCInstrInfo.td#L3169 (see Uses = [RM]),
• SystemZ: https://github.com/llvm/llvm-project/blob/9e28b89827a3be4ab602b40c263839665af06b4a/llvm/lib/Target/SystemZ/SystemZInstrFP.td#L434 (see let Uses = [FPC])

We're also still unclear about the advantage of changing codegen to default to a static rounding mode (which might be a surprising change, as all software compiled to date on both GCC and LLVM has used used the dynamic rounding mode by default).

Instructions in assembler without explicit rounding mode specification get dynamic rounding mode as now. Lowering of FP operations like fadd uses static rounding mode RNE, because these operations assume default floating point environment (https://llvm.org/docs/LangRef.html#floating-point-environment). Using static rounding mode has some advantages over assuming frm to have particular value. The code that requires default rounding mode does not require setting rfm in a program where some pieces uses non-default rounding mode. Such code works as designed even if it is called from a region where other rounding mode is set. Such implementation simplifies implementation of things like #pragma STDC FENV_ROUND and make programs more robust.

jrtc27 added a comment.Mar 5 2021, 4:52 AM
In D94163#2603738, @asb wrote:

We discussed this briefly in the RISC-V call as I noted this patchset has been sat open for some time. One thing that might be helpful is whether you could say a little bit more about the goal for this patchset.

Most floating point instructions set accrued exception bits in fflags register. If an instruction is specified with dynamic rounding mode, it also depends on the content of frm register. So in the following code:

csrwi  frm, a1
fadd.d ft2, ft2, ft3

changing the order of instruction is not allowed, because fadd.d depends on the content of frm, which is changed by the previous instruction. Similarly, the code:

fadd.d ft2, ft2, ft3
csrrs t0, fflags, zero

does not allow to change the order of the instructions, as crsrs reads content of fflags, which is set by the first instruction.

Now nothing prevents the compiler from changing the order of instructions in these examples. Existing instruction definitions are unable to express these dependencies. It does not allow to write programs that use non-default floating point environment, for example, dynamic rounding mode.

The goal of this patchset is to establish means to express such dependencies.

Once this lands, what's the next step?

This is the first and necessary step toward full-fledged implementation of floating point support. In particular it would allow progress in D91242 and D90854. Actually this work was undertaken because the lack of dependencies prevents implementation of llvm.set.rounding (https://reviews.llvm.org/D91242#2400476). The next steps, of course, include support of constrained intrinsics for RSCV.

Is there some relevant RFC,

If RFC can facilitate review of this patchset, I will prepare it.

or equivalent changes being made to other in-tree architectures?

Targets that support full-fledged floating point operations add implicit uses and definitions of FP state and control register(s). For example:
• X86: https://github.com/llvm/llvm-project/blob/bc172e532a89754d47fef1306064a26a4dc0a76b/llvm/lib/Target/X86/X86InstrFPStack.td#L728 (see let Defs = [FPSW] and let Uses = [FPCW]),
• PowerPC: https://github.com/llvm/llvm-project/blob/e7361c8eccb7663146096622549dc03240414157/llvm/lib/Target/PowerPC/PPCInstrInfo.td#L3169 (see Uses = [RM]),
• SystemZ: https://github.com/llvm/llvm-project/blob/9e28b89827a3be4ab602b40c263839665af06b4a/llvm/lib/Target/SystemZ/SystemZInstrFP.td#L434 (see let Uses = [FPC])

We're also still unclear about the advantage of changing codegen to default to a static rounding mode (which might be a surprising change, as all software compiled to date on both GCC and LLVM has used used the dynamic rounding mode by default).

Instructions in assembler without explicit rounding mode specification get dynamic rounding mode as now. Lowering of FP operations like fadd uses static rounding mode RNE, because these operations assume default floating point environment (https://llvm.org/docs/LangRef.html#floating-point-environment). Using static rounding mode has some advantages over assuming frm to have particular value. The code that requires default rounding mode does not require setting rfm in a program where some pieces uses non-default rounding mode. Such code works as designed even if it is called from a region where other rounding mode is set. Such implementation simplifies implementation of things like #pragma STDC FENV_ROUND and make programs more robust.

That's going to break huge piles of C/C++ code that sets the (dynamic) rounding mode and expects it to have an effect on subsequent computations. I do not think that is a good idea.

In D94163#2603738, @asb wrote:

We're also still unclear about the advantage of changing codegen to default to a static rounding mode (which might be a surprising change, as all software compiled to date on both GCC and LLVM has used used the dynamic rounding mode by default).

Instructions in assembler without explicit rounding mode specification get dynamic rounding mode as now. Lowering of FP operations like fadd uses static rounding mode RNE, because these operations assume default floating point environment (https://llvm.org/docs/LangRef.html#floating-point-environment). Using static rounding mode has some advantages over assuming frm to have particular value. The code that requires default rounding mode does not require setting rfm in a program where some pieces uses non-default rounding mode. Such code works as designed even if it is called from a region where other rounding mode is set. Such implementation simplifies implementation of things like #pragma STDC FENV_ROUND and make programs more robust.

That's going to break huge piles of C/C++ code that sets the (dynamic) rounding mode and expects it to have an effect on subsequent computations. I do not think that is a good idea.

You are right, it might be dangerous. While RISC-V does not support constrained intrinsics, it would be safer to use instructions with dynamic rounding mode.

sepavloff updated this revision to Diff 334391.Wed, Mar 31, 3:03 AM

Updated patch for alternative CSR solution