Page MenuHomePhabricator

[RISCV] Add IR intrinsics for reading/write vxrm.
Needs ReviewPublic

Authored by craig.topper on Nov 8 2021, 3:43 PM.

Details

Summary

Clang builtins for these will come later along with changes to
riscv_vector.h

Co-authored-by: ShihPo Hung <shihpo.hung@sifive.com>

Diff Detail

Unit TestsFailed

TimeTest
100 msx64 debian > LLVM.tools/llvm-tli-checker::ps4-tli-check.s
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llvm-mc --triple=x86_64-scei-ps4 --filetype=obj /var/lib/buildkite-agent/builds/llvm-project/llvm/test/tools/llvm-tli-checker/ps4-tli-check.s -o /var/lib/buildkite-agent/builds/llvm-project/build/test/tools/llvm-tli-checker/Output/ps4-tli-check.s.tmp.o

Event Timeline

craig.topper created this revision.Nov 8 2021, 3:43 PM
craig.topper requested review of this revision.Nov 8 2021, 3:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2021, 3:43 PM
Herald added a subscriber: MaskRay. · View Herald Transcript
rogfer01 added inline comments.Nov 18 2021, 12:43 AM
llvm/lib/Target/RISCV/RISCVInstrInfo.td
1338

Next to the FRM above this makes me curious. Either we don't need this or we forgot it for the FRM ones?

It is a bit surprising we defined ReadSysReg / WriteSysReg / WriteSysRegImm as not having side effects. I wonder what is the reason for that.

I'm sure I'm missing something here.

craig.topper added inline comments.Nov 18 2021, 9:28 AM
llvm/lib/Target/RISCV/RISCVInstrInfo.td
1338

FRM is incomplete. None of the FP instructions are annotated to with Uses/Defs of FRM.

I think we found issues with instruction scheduling for VXRM on our internal branch if we didn't have hasSideEffects here. I'll see if I can find the discussion and if there was a test case added. I suspect FRM also needs it, but hasn't been tested as much.

craig.topper added inline comments.Nov 18 2021, 9:45 AM
llvm/lib/Target/RISCV/RISCVInstrInfo.td
1338

Ok I reviewed the discussion. I think what happens is InstrEmitter.cpp sets the physical register def as Dead because SelectionDAG doesn't see the VXRM instruction glued to any instructions. This allowed MachineCSE to remove or CSE something in a way it shouldn't have.

Given past experiences I've had in the past with registers like this, I'd recommend not implementing these intrinsics in their current state. Giving the user access to directly mess with registers like this makes IR optimizations harder because we have to model the intrinsics that depend on the register as reading/writing memory. And there are bad interactions between the register state and calls introduced by the compiler: if the compiler introduces a call, that call might not preserve the state.

Instead, I'd suggest modifying the intrinsics for the instructions that depend on this register: make them take the desired rounding mode as an argument. You can then lower the intrinsics in the backend: for each instruction that needs a rounding mode, set the rounding mode before it. This pass that does this lowering can coalesce/hoist the register modifications when appropriate.

craig.topper added a comment.EditedDec 9 2021, 12:16 PM

Given past experiences I've had in the past with registers like this, I'd recommend not implementing these intrinsics in their current state. Giving the user access to directly mess with registers like this makes IR optimizations harder because we have to model the intrinsics that depend on the register as reading/writing memory. And there are bad interactions between the register state and calls introduced by the compiler: if the compiler introduces a call, that call might not preserve the state.

Instead, I'd suggest modifying the intrinsics for the instructions that depend on this register: make them take the desired rounding mode as an argument. You can then lower the intrinsics in the backend: for each instruction that needs a rounding mode, set the rounding mode before it. This pass that does this lowering can coalesce/hoist the register modifications when appropriate.

Thanks Eli. Do you recommend lowering in SelectionDAG or as a separate pass? Should we restore the original value after the call? Are there examples in tree?

Maybe I should also mention that my immediate goal was to enable Halide to implement halving_add on RISCV vectors. They were going to use these intrinsics and existing RVV intrinsics that are marked as having side effects. I had wondered if it would make sense to have generic IR intrinsics for halving add. I could lower that to a vxrm change plus the vaadd instruction.

If you want an example of something broken because we didn't do it correctly, look at the ARM intrinsics involving the saturation bit. For an example of something that works... maybe the x86 "DF" flag?

Probably best to do the lowering after isel: we don't have any infrastructure for hoisting/coalescing the instructions that modify the register (unless you want to try to make the register allocator handle it).

Maybe look at X86VZeroUpper.cpp for inspiration; not exactly the same thing, but a similar algorithm should work.

How you deal with calls depends on the calling convention you want to use. There are basically three possibilities:

  1. We require that the register is zero (or some other fixed value) on entry to/return from a call.
  2. The register is volatile across calls.
  3. The register is preserved by calls.

It looks like the current ABI documentation says it has "thread storage duration", but that's not really a great idea; we don't want to expose this in the first place. Hopefully this isn't set in stone? If necessary, we could use some hack like preserving the incoming value, but assuming outgoing calls clobber it, I guess.

If you're using a state machine like X86VZeroUpper, it should be easy to handle in any case.

Even if you don't expose it in C the ABI still needs to specify what happens across call boundaries? I don't see what issues thread storage duration poses, as everything it implies is either a hard requirement (other threads don't clobber it) or for efficiency reasons (avoiding saving and restoring in any function that clobbers it)? It's a bit strange language to use though I'll admit given thread storage is a C language concept.

Maybe I should also mention that my immediate goal was to enable Halide to implement halving_add on RISCV vectors. They were going to use these intrinsics and existing RVV intrinsics that are marked as having side effects.

This is part of why we want the compiler to control the rounding/saturation state, not the user. You don't want halving_add to have side-effects. :)

I had wondered if it would make sense to have generic IR intrinsics for halving add. I could lower that to a vxrm change plus the vaadd instruction.

You mean target-independent? This is orthogonal to the question of what the riscv intrinsics should look like. But it would make sense; it's a common operation in vector instruction sets.

Even if you don't expose it in C the ABI still needs to specify what happens across call boundaries? I don't see what issues thread storage duration poses, as everything it implies is either a hard requirement (other threads don't clobber it) or for efficiency reasons (avoiding saving and restoring in any function that clobbers it)? It's a bit strange language to use though I'll admit given thread storage is a C language concept.

My reading is that "thread storage duration" is supposed to mean that if the user sets it in one function, and retrieves it later in some other function, it should have the same value. This matches the corresponding language for the floating-point state register. That implies the sort of mess we have with floating-point rounding modes and error flags; that's not something we want to emulate in other contexts.

If we're treating it as a normal register, I would expect language like "this register is preserved across calls" or "this register isn't preserved across calls" or something like that.

jrtc27 added a comment.Dec 9 2021, 1:27 PM

Even if you don't expose it in C the ABI still needs to specify what happens across call boundaries? I don't see what issues thread storage duration poses, as everything it implies is either a hard requirement (other threads don't clobber it) or for efficiency reasons (avoiding saving and restoring in any function that clobbers it)? It's a bit strange language to use though I'll admit given thread storage is a C language concept.

My reading is that "thread storage duration" is supposed to mean that if the user sets it in one function, and retrieves it later in some other function, it should have the same value. This matches the corresponding language for the floating-point state register. That implies the sort of mess we have with floating-point rounding modes and error flags; that's not something we want to emulate in other contexts.

If we're treating it as a normal register, I would expect language like "this register is preserved across calls" or "this register isn't preserved across calls" or something like that.

I see; I thought we did say it was call-clobbered, but we don't, because the equivalent fcsr isn't. The intent behind the current wording was to exactly follow what's done for floating-point, for minimum surprise and disruption; even if the floating-point environment is a bad idea for efficient implementation, I'm not sure doing something totally different for vectors is a good idea. Bear in mind that, for floating-point vector operations, the rounding mode is still what's used for scalar floats, this only affects fixed-point operations.

I think at the IR level, we need the intrinsic variants that have a rounding mode argument if we want to allow the compiler to ever generate the relevant instructions for autovectorization etc. The autovectorizer can't use vgetvrm/vsetvrm correctly and efficiently; it can't tell, for example, where the backend will insert runtime calls.

I think it makes sense to expose there variants to users directly, rather than exposing raw vgetvrm/vsetvrm. I think the risk of confusion is minimal; if the user looks at the riscv_vector.h documentation and sees that the shift intrinsic has a "rounding mode" argument, it should be clear how to use it. The user shouldn't need to care that the bits aren't actually part of the instruction encoding.

That said, if we really want to expose vgetvrm/vsetvrm and implicit rounding modes as C intrinsics, we can do it, I guess. The compiler can save/restore the state if it needs to. It makes the implementation more complicated, and the code less efficient at runtime, though.

jrtc27 added a comment.Dec 9 2021, 2:16 PM

For confusion I meant that doing something other than this patch for vxrm will result in different behaviour between floating-point vector operations (which use the pre-existing scalar float rounding mode that's exposed to C) and fixed-point vector operations (which is what vxrm governs). I believe both should offer the same set of interfaces to users.

craig.topper added a comment.EditedDec 9 2021, 2:27 PM

For confusion I meant that doing something other than this patch for vxrm will result in different behaviour between floating-point vector operations (which use the pre-existing scalar float rounding mode that's exposed to C) and fixed-point vector operations (which is what vxrm governs). I believe both should offer the same set of interfaces to users.

Is your opinion that we should do this patch so that they are the same, or that we should change how FP works too?

The floating point interface requires that you use #pragma STDC FENV_ACCESS if you call fesetround. If we implement this patch, that would still be a difference.

Floating point rounding for vectors is currently messed up because we don't mark the FP instruction has having side effects. And no target has defined how to extend constrained intrinsics to target specific intrinsics.

Changing vxrm for integers is potentially going to be more common than changing FP rounding mode. Halide developers have already raised complaints that they need to change vxrm to use vaadd for halving_add and rounding_halving_add. This gets expensive for code that mixes both operations. https://github.com/riscv/riscv-v-spec/issues/739

I agree with the idea that not providing the raw vgetvxrm/vsetvxrm but providing the variants instead.
I see it brings less issues to programmers, LLVM autovectorizer, Halide backend, and optimization passes than the other way.
And since they are target-specific intrinsics, the difference could be understandable.