This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/RISCV/RISCVInstrInfo.td
1338	Next to the `FRM` above this makes me curious. Either we don't need this or we forgot it for the `FRM` ones? It is a bit surprising we defined `ReadSysReg` / `WriteSysReg` / `WriteSysRegImm` as not having side effects. I wonder what is the reason for that. I'm sure I'm missing something here.

craig.topper added inline comments.Nov 18 2021, 9:28 AM

llvm/lib/Target/RISCV/RISCVInstrInfo.td
1338	FRM is incomplete. None of the FP instructions are annotated to with Uses/Defs of FRM. I think we found issues with instruction scheduling for VXRM on our internal branch if we didn't have hasSideEffects here. I'll see if I can find the discussion and if there was a test case added. I suspect FRM also needs it, but hasn't been tested as much.

craig.topper added inline comments.Nov 18 2021, 9:45 AM

llvm/lib/Target/RISCV/RISCVInstrInfo.td
1338	Ok I reviewed the discussion. I think what happens is InstrEmitter.cpp sets the physical register def as Dead because SelectionDAG doesn't see the VXRM instruction glued to any instructions. This allowed MachineCSE to remove or CSE something in a way it shouldn't have.

Ping

Given past experiences I've had in the past with registers like this, I'd recommend not implementing these intrinsics in their current state. Giving the user access to directly mess with registers like this makes IR optimizations harder because we have to model the intrinsics that depend on the register as reading/writing memory. And there are bad interactions between the register state and calls introduced by the compiler: if the compiler introduces a call, that call might not preserve the state.

Instead, I'd suggest modifying the intrinsics for the instructions that depend on this register: make them take the desired rounding mode as an argument. You can then lower the intrinsics in the backend: for each instruction that needs a rounding mode, set the rounding mode before it. This pass that does this lowering can coalesce/hoist the register modifications when appropriate.

In D113439#3183627, @efriedma wrote:

Given past experiences I've had in the past with registers like this, I'd recommend not implementing these intrinsics in their current state. Giving the user access to directly mess with registers like this makes IR optimizations harder because we have to model the intrinsics that depend on the register as reading/writing memory. And there are bad interactions between the register state and calls introduced by the compiler: if the compiler introduces a call, that call might not preserve the state.

Instead, I'd suggest modifying the intrinsics for the instructions that depend on this register: make them take the desired rounding mode as an argument. You can then lower the intrinsics in the backend: for each instruction that needs a rounding mode, set the rounding mode before it. This pass that does this lowering can coalesce/hoist the register modifications when appropriate.

Thanks Eli. Do you recommend lowering in SelectionDAG or as a separate pass? Should we restore the original value after the call? Are there examples in tree?

Maybe I should also mention that my immediate goal was to enable Halide to implement halving_add on RISCV vectors. They were going to use these intrinsics and existing RVV intrinsics that are marked as having side effects. I had wondered if it would make sense to have generic IR intrinsics for halving add. I could lower that to a vxrm change plus the vaadd instruction.

If you want an example of something broken because we didn't do it correctly, look at the ARM intrinsics involving the saturation bit. For an example of something that works... maybe the x86 "DF" flag?

Probably best to do the lowering after isel: we don't have any infrastructure for hoisting/coalescing the instructions that modify the register (unless you want to try to make the register allocator handle it).

Maybe look at X86VZeroUpper.cpp for inspiration; not exactly the same thing, but a similar algorithm should work.

How you deal with calls depends on the calling convention you want to use. There are basically three possibilities:

We require that the register is zero (or some other fixed value) on entry to/return from a call.
The register is volatile across calls.
The register is preserved by calls.

It looks like the current ABI documentation says it has "thread storage duration", but that's not really a great idea; we don't want to expose this in the first place. Hopefully this isn't set in stone? If necessary, we could use some hack like preserving the incoming value, but assuming outgoing calls clobber it, I guess.

If you're using a state machine like X86VZeroUpper, it should be easy to handle in any case.

Even if you don't expose it in C the ABI still needs to specify what happens across call boundaries? I don't see what issues thread storage duration poses, as everything it implies is either a hard requirement (other threads don't clobber it) or for efficiency reasons (avoiding saving and restoring in any function that clobbers it)? It's a bit strange language to use though I'll admit given thread storage is a C language concept.

In D113439#3183639, @craig.topper wrote:

Maybe I should also mention that my immediate goal was to enable Halide to implement halving_add on RISCV vectors. They were going to use these intrinsics and existing RVV intrinsics that are marked as having side effects.

This is part of why we want the compiler to control the rounding/saturation state, not the user. You don't want halving_add to have side-effects. :)

I had wondered if it would make sense to have generic IR intrinsics for halving add. I could lower that to a vxrm change plus the vaadd instruction.

You mean target-independent? This is orthogonal to the question of what the riscv intrinsics should look like. But it would make sense; it's a common operation in vector instruction sets.

In D113439#3183745, @jrtc27 wrote:

Even if you don't expose it in C the ABI still needs to specify what happens across call boundaries? I don't see what issues thread storage duration poses, as everything it implies is either a hard requirement (other threads don't clobber it) or for efficiency reasons (avoiding saving and restoring in any function that clobbers it)? It's a bit strange language to use though I'll admit given thread storage is a C language concept.

My reading is that "thread storage duration" is supposed to mean that if the user sets it in one function, and retrieves it later in some other function, it should have the same value. This matches the corresponding language for the floating-point state register. That implies the sort of mess we have with floating-point rounding modes and error flags; that's not something we want to emulate in other contexts.

If we're treating it as a normal register, I would expect language like "this register is preserved across calls" or "this register isn't preserved across calls" or something like that.

In D113439#3183818, @efriedma wrote:

In D113439#3183745, @jrtc27 wrote:

Even if you don't expose it in C the ABI still needs to specify what happens across call boundaries? I don't see what issues thread storage duration poses, as everything it implies is either a hard requirement (other threads don't clobber it) or for efficiency reasons (avoiding saving and restoring in any function that clobbers it)? It's a bit strange language to use though I'll admit given thread storage is a C language concept.

My reading is that "thread storage duration" is supposed to mean that if the user sets it in one function, and retrieves it later in some other function, it should have the same value. This matches the corresponding language for the floating-point state register. That implies the sort of mess we have with floating-point rounding modes and error flags; that's not something we want to emulate in other contexts.

If we're treating it as a normal register, I would expect language like "this register is preserved across calls" or "this register isn't preserved across calls" or something like that.

I see; I thought we did say it was call-clobbered, but we don't, because the equivalent fcsr isn't. The intent behind the current wording was to exactly follow what's done for floating-point, for minimum surprise and disruption; even if the floating-point environment is a bad idea for efficient implementation, I'm not sure doing something totally different for vectors is a good idea. Bear in mind that, for floating-point vector operations, the rounding mode is still what's used for scalar floats, this only affects fixed-point operations.

I think at the IR level, we need the intrinsic variants that have a rounding mode argument if we want to allow the compiler to ever generate the relevant instructions for autovectorization etc. The autovectorizer can't use vgetvrm/vsetvrm correctly and efficiently; it can't tell, for example, where the backend will insert runtime calls.

I think it makes sense to expose there variants to users directly, rather than exposing raw vgetvrm/vsetvrm. I think the risk of confusion is minimal; if the user looks at the riscv_vector.h documentation and sees that the shift intrinsic has a "rounding mode" argument, it should be clear how to use it. The user shouldn't need to care that the bits aren't actually part of the instruction encoding.

That said, if we really want to expose vgetvrm/vsetvrm and implicit rounding modes as C intrinsics, we can do it, I guess. The compiler can save/restore the state if it needs to. It makes the implementation more complicated, and the code less efficient at runtime, though.

For confusion I meant that doing something other than this patch for vxrm will result in different behaviour between floating-point vector operations (which use the pre-existing scalar float rounding mode that's exposed to C) and fixed-point vector operations (which is what vxrm governs). I believe both should offer the same set of interfaces to users.

In D113439#3183941, @jrtc27 wrote:

For confusion I meant that doing something other than this patch for vxrm will result in different behaviour between floating-point vector operations (which use the pre-existing scalar float rounding mode that's exposed to C) and fixed-point vector operations (which is what vxrm governs). I believe both should offer the same set of interfaces to users.

Is your opinion that we should do this patch so that they are the same, or that we should change how FP works too?

The floating point interface requires that you use #pragma STDC FENV_ACCESS if you call fesetround. If we implement this patch, that would still be a difference.

Floating point rounding for vectors is currently messed up because we don't mark the FP instruction has having side effects. And no target has defined how to extend constrained intrinsics to target specific intrinsics.

Changing vxrm for integers is potentially going to be more common than changing FP rounding mode. Halide developers have already raised complaints that they need to change vxrm to use vaadd for halving_add and rounding_halving_add. This gets expensive for code that mixes both operations. https://github.com/riscv/riscv-v-spec/issues/739

I agree with the idea that not providing the raw vgetvxrm/vsetvxrm but providing the variants instead.
I see it brings less issues to programmers, LLVM autovectorizer, Halide backend, and optimization passes than the other way.
And since they are target-specific intrinsics, the difference could be understandable.

arcbbb mentioned this in D121376: [RISCV][RVV] Introduce roundmode operand to PseudoVAADD instruction.Mar 10 2022, 7:20 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsRISCV.td

6 lines

lib/

Target/

RISCV/

RISCVISelLowering.cpp

16 lines

RISCVInstrInfo.td

6 lines

RISCVSystemOperands.td

2 lines

test/

CodeGen/

RISCV/

rvv/

vxrm-access.ll

38 lines

Diff 385649

llvm/include/llvm/IR/IntrinsicsRISCV.td

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	def int_riscv_vsetvli : Intrinsic<[llvm_anyint_ty],
ImmArg<ArgIndex<2>>]>;		ImmArg<ArgIndex<2>>]>;
def int_riscv_vsetvlimax : Intrinsic<[llvm_anyint_ty],		def int_riscv_vsetvlimax : Intrinsic<[llvm_anyint_ty],
/* VSEW */ [LLVMMatchType<0>,		/* VSEW */ [LLVMMatchType<0>,
/* VLMUL */ LLVMMatchType<0>],		/* VLMUL */ LLVMMatchType<0>],
[IntrNoMem, IntrHasSideEffects,		[IntrNoMem, IntrHasSideEffects,
ImmArg<ArgIndex<0>>,		ImmArg<ArgIndex<0>>,
ImmArg<ArgIndex<1>>]>;		ImmArg<ArgIndex<1>>]>;

		def int_riscv_vsetvxrm : Intrinsic<[],
		[llvm_anyint_ty],
		[IntrNoMem, IntrHasSideEffects]>;
		def int_riscv_vgetvxrm : Intrinsic<[llvm_anyint_ty],
		[],
		[IntrNoMem, IntrHasSideEffects]>;
// For unit stride load		// For unit stride load
// Input: (pointer, vl)		// Input: (pointer, vl)
class RISCVUSLoad		class RISCVUSLoad
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[LLVMPointerType<LLVMMatchType<0>>,		[LLVMPointerType<LLVMMatchType<0>>,
llvm_anyint_ty],		llvm_anyint_ty],
[NoCapture<ArgIndex<0>>, IntrReadMem]>, RISCVVIntrinsic;		[NoCapture<ArgIndex<0>>, IntrReadMem]>, RISCVVIntrinsic;
// For unit stride fault-only-first load		// For unit stride fault-only-first load
▲ Show 20 Lines • Show All 1,134 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	//===-- RISCVISelLowering.cpp - RISCV DAG Lowering Implementation --------===//			//===-- RISCVISelLowering.cpp - RISCV DAG Lowering Implementation --------===//
				Lint: Lint Inline Actions clang-format not found in user’s local PATH; not linting file. Lint: Lint: clang-format not found in user’s local PATH; not linting file.
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	▲ Show 20 Lines • Show All 4,003 Lines • ▼ Show 20 Lines
	}			}

	SDValue RISCVTargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,			SDValue RISCVTargetLowering::LowerINTRINSIC_W_CHAIN(SDValue Op,
	SelectionDAG &DAG) const {			SelectionDAG &DAG) const {
	unsigned IntNo = Op.getConstantOperandVal(1);			unsigned IntNo = Op.getConstantOperandVal(1);
	switch (IntNo) {			switch (IntNo) {
	default:			default:
	break;			break;
				case Intrinsic::riscv_vgetvxrm: {
				SDLoc DL(Op);
				SDValue SysRegNo =
				DAG.getTargetConstant(RISCVSysReg::lookupSysRegByName("VXRM")->Encoding,
				DL, Subtarget.getXLenVT());
				return DAG.getNode(RISCVISD::READ_CSR, DL, Op->getVTList(),
				Op.getOperand(0), SysRegNo);
				}
	case Intrinsic::riscv_masked_strided_load: {			case Intrinsic::riscv_masked_strided_load: {
	SDLoc DL(Op);			SDLoc DL(Op);
	MVT XLenVT = Subtarget.getXLenVT();			MVT XLenVT = Subtarget.getXLenVT();

	// If the mask is known to be all ones, optimize to an unmasked intrinsic;			// If the mask is known to be all ones, optimize to an unmasked intrinsic;
	// the selection of the masked intrinsics doesn't do this for us.			// the selection of the masked intrinsics doesn't do this for us.
	SDValue Mask = Op.getOperand(5);			SDValue Mask = Op.getOperand(5);
	bool IsUnmasked = ISD::isConstantSplatVectorAllOnes(Mask.getNode());			bool IsUnmasked = ISD::isConstantSplatVectorAllOnes(Mask.getNode());
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	}			}

	SDValue RISCVTargetLowering::LowerINTRINSIC_VOID(SDValue Op,			SDValue RISCVTargetLowering::LowerINTRINSIC_VOID(SDValue Op,
	SelectionDAG &DAG) const {			SelectionDAG &DAG) const {
	unsigned IntNo = Op.getConstantOperandVal(1);			unsigned IntNo = Op.getConstantOperandVal(1);
	switch (IntNo) {			switch (IntNo) {
	default:			default:
	break;			break;
				case Intrinsic::riscv_vsetvxrm: {
				SDLoc DL(Op);
				SDValue SysRegNo =
				DAG.getTargetConstant(RISCVSysReg::lookupSysRegByName("VXRM")->Encoding,
				DL, Subtarget.getXLenVT());
				return DAG.getNode(RISCVISD::WRITE_CSR, DL, MVT::Other, Op.getOperand(0),
				SysRegNo, Op.getOperand(2));
				}
	case Intrinsic::riscv_masked_strided_store: {			case Intrinsic::riscv_masked_strided_store: {
	SDLoc DL(Op);			SDLoc DL(Op);
	MVT XLenVT = Subtarget.getXLenVT();			MVT XLenVT = Subtarget.getXLenVT();

	// If the mask is known to be all ones, optimize to an unmasked intrinsic;			// If the mask is known to be all ones, optimize to an unmasked intrinsic;
	// the selection of the masked intrinsics doesn't do this for us.			// the selection of the masked intrinsics doesn't do this for us.
	SDValue Mask = Op.getOperand(5);			SDValue Mask = Op.getOperand(5);
	bool IsUnmasked = ISD::isConstantSplatVectorAllOnes(Mask.getNode());			bool IsUnmasked = ISD::isConstantSplatVectorAllOnes(Mask.getNode());
	▲ Show 20 Lines • Show All 5,884 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVInstrInfo.td

Show First 20 Lines • Show All 1,329 Lines • ▼ Show 20 Lines	class SwapSysRegImm<SysReg SR, list<Register> Regs>
let Uses = Regs;		let Uses = Regs;
let Defs = Regs;		let Defs = Regs;
}		}

def ReadFRM : ReadSysReg<SysRegFRM, [FRM]>;		def ReadFRM : ReadSysReg<SysRegFRM, [FRM]>;
def WriteFRM : WriteSysReg<SysRegFRM, [FRM]>;		def WriteFRM : WriteSysReg<SysRegFRM, [FRM]>;
def WriteFRMImm : WriteSysRegImm<SysRegFRM, [FRM]>;		def WriteFRMImm : WriteSysRegImm<SysRegFRM, [FRM]>;

		let hasSideEffects = true in {
		rogfer01Unsubmitted Not Done Reply Inline Actions Next to the `FRM` above this makes me curious. Either we don't need this or we forgot it for the `FRM` ones? It is a bit surprising we defined `ReadSysReg` / `WriteSysReg` / `WriteSysRegImm` as not having side effects. I wonder what is the reason for that. I'm sure I'm missing something here. rogfer01: Next to the `FRM` above this makes me curious. Either we don't need this or we forgot it for…
		craig.topperAuthorUnsubmitted Done Reply Inline Actions FRM is incomplete. None of the FP instructions are annotated to with Uses/Defs of FRM. I think we found issues with instruction scheduling for VXRM on our internal branch if we didn't have hasSideEffects here. I'll see if I can find the discussion and if there was a test case added. I suspect FRM also needs it, but hasn't been tested as much. craig.topper: FRM is incomplete. None of the FP instructions are annotated to with Uses/Defs of FRM. I think…
		craig.topperAuthorUnsubmitted Done Reply Inline Actions Ok I reviewed the discussion. I think what happens is InstrEmitter.cpp sets the physical register def as Dead because SelectionDAG doesn't see the VXRM instruction glued to any instructions. This allowed MachineCSE to remove or CSE something in a way it shouldn't have. craig.topper: Ok I reviewed the discussion. I think what happens is InstrEmitter.cpp sets the physical…
		def ReadVXRM : ReadSysReg<SysRegVXRM, [VXRM]>;
		def WriteVXRM: WriteSysReg<SysRegVXRM, [VXRM]>;
		def WriteVXRMImm: WriteSysRegImm<SysRegVXRM, [VXRM]>;
		}

/// Other pseudo-instructions		/// Other pseudo-instructions

// Pessimistically assume the stack pointer will be clobbered		// Pessimistically assume the stack pointer will be clobbered
let Defs = [X2], Uses = [X2] in {		let Defs = [X2], Uses = [X2] in {
def ADJCALLSTACKDOWN : Pseudo<(outs), (ins i32imm:$amt1, i32imm:$amt2),		def ADJCALLSTACKDOWN : Pseudo<(outs), (ins i32imm:$amt1, i32imm:$amt2),
[(callseq_start timm:$amt1, timm:$amt2)]>;		[(callseq_start timm:$amt1, timm:$amt2)]>;
def ADJCALLSTACKUP : Pseudo<(outs), (ins i32imm:$amt1, i32imm:$amt2),		def ADJCALLSTACKUP : Pseudo<(outs), (ins i32imm:$amt1, i32imm:$amt2),
[(callseq_end timm:$amt1, timm:$amt2)]>;		[(callseq_end timm:$amt1, timm:$amt2)]>;
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVSystemOperands.td

	Show First 20 Lines • Show All 378 Lines • ▼ Show 20 Lines
	def : SysReg<"dscratch0", 0x7B2>;			def : SysReg<"dscratch0", 0x7B2>;
	def : SysReg<"dscratch1", 0x7B3>;			def : SysReg<"dscratch1", 0x7B3>;

	//===-----------------------------------------------			//===-----------------------------------------------
	// User Vector CSRs			// User Vector CSRs
	//===-----------------------------------------------			//===-----------------------------------------------
	def : SysReg<"vstart", 0x008>;			def : SysReg<"vstart", 0x008>;
	def : SysReg<"vxsat", 0x009>;			def : SysReg<"vxsat", 0x009>;
	def : SysReg<"vxrm", 0x00A>;			def SysRegVXRM : SysReg<"vxrm", 0x00A>;
	def : SysReg<"vcsr", 0x00F>;			def : SysReg<"vcsr", 0x00F>;
	def : SysReg<"vl", 0xC20>;			def : SysReg<"vl", 0xC20>;
	def : SysReg<"vtype", 0xC21>;			def : SysReg<"vtype", 0xC21>;
	def : SysReg<"vlenb", 0xC22>;			def : SysReg<"vlenb", 0xC22>;

llvm/test/CodeGen/RISCV/rvv/vxrm-access.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: sed 's/iXLen/i32/g' %s \| llc -mtriple=riscv32 -mattr=+experimental-v \
				; RUN: -verify-machineinstrs \| FileCheck --check-prefix=CHECK %s
				; RUN: sed 's/iXLen/i64/g' %s \| llc -mtriple=riscv64 -mattr=+experimental-v \
				; RUN: -verify-machineinstrs \| FileCheck --check-prefix=CHECK %s

				declare iXLen @llvm.riscv.vgetvxrm();
				declare void @llvm.riscv.vsetvxrm.iXLen(iXLen %roundmode);

				define iXLen @readvxrm() nounwind
				; CHECK-LABEL: readvxrm:
				; CHECK: # %bb.0:
				; CHECK-NEXT: csrr a0, vxrm
				; CHECK-NEXT: ret
				{
				%ret = call iXLen @llvm.riscv.vgetvxrm()
				ret iXLen %ret
				}

				define void @writevxrm(iXLen %roundmode) nounwind
				; CHECK-LABEL: writevxrm:
				; CHECK: # %bb.0:
				; CHECK-NEXT: csrw vxrm, a0
				; CHECK-NEXT: ret
				{
				call void @llvm.riscv.vsetvxrm.iXLen(iXLen %roundmode)
				ret void
				}

				define void @writevxrmimm() nounwind
				; CHECK-LABEL: writevxrmimm:
				; CHECK: # %bb.0:
				; CHECK-NEXT: csrwi vxrm, 3
				; CHECK-NEXT: ret
				{
				call void @llvm.riscv.vsetvxrm.iXLen(iXLen 3)
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Add IR intrinsics for reading/write vxrm.Needs ReviewPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline