This is an archive of the discontinued LLVM Phabricator instance.

[x86] Split MXCSR into two pseudo-registers
AbandonedPublic

Authored by andrew.w.kaylor on Mar 6 2017, 9:41 AM.

Download Raw Diff

Details

Reviewers

zvi
rnk
efriedma

Summary

Split MXCSR into two pseudo-registers so that the control bits and the status bits can be modeled separately. This register cannot be used as an operand to any instruction so we are free to model it in whatever way is most useful for producing correct code.

This patch only updates the instructions that load and save the entire contents of the register, so both control and status parts are referenced together here. A subsequent patch will update floating point operations to add an implicit use of the control bits and an implicit def of the status bits. This will guarantee that FP instructions are not hoisted above or sunk below the instructions that set the control bits or read the status bits without causing FP operations to act as barriers to one another.

I will be posting another patch shortly to update the clang front end to recognize this change in register naming.

Diff Detail

Repository: rL LLVM

Event Timeline

andrew.w.kaylor created this revision.Mar 6 2017, 9:41 AM

andrew.w.kaylor added a child revision: D30662: Update clang filtering for mxcsr.Mar 6 2017, 9:44 AM

craig.topper added a subscriber: craig.topper.Mar 6 2017, 10:09 AM

craig.topper removed a subscriber: craig.topper.

craig.topper added a subscriber: craig.topper.

lgtm with other clang fix

This revision now requires changes to proceed.Mar 6 2017, 2:20 PM

This isn't backward-compatible with existing IR which clobbers mxcsr. You could auto-upgrade, I guess. Alternatively, you could make the status bits a subregister of MXCSR instead of modeling it as two completely separate registers.

A subsequent patch will update floating point operations to add an implicit use of the control bits and an implicit def of the status bits

This seems kind of confusing... strict floating-point ops need to implicitly use and def the status bits, because the new value depends on the previous value. You can think of an FP operation as a logical OR acting on the status register. Many kinds of code motion are legal (e.g. you can reorder FP operations with each other, or hoist them out of loops). But if you omit the use, other optimizations won't work correctly; for example, dead code elimination will eliminate FP operations which have a visible effect on the status register.

Given that, I'm not sure what splitting the status register buys you; I guess it becomes easier to check whether an instruction modifies the control bits?

In D30661#693773, @efriedma wrote:

This isn't backward-compatible with existing IR which clobbers mxcsr. You could auto-upgrade, I guess. Alternatively, you could make the status bits a subregister of MXCSR instead of modeling it as two completely separate registers.

I didn't think these were supported for modelling inline asm constraints. Besides, the "mxcsr" constraint is less than two months old. Certainly we aren't required to be backwards compatible with it yet.

In D30661#693773, @efriedma wrote:

A subsequent patch will update floating point operations to add an implicit use of the control bits and an implicit def of the status bits

This seems kind of confusing... strict floating-point ops need to implicitly use and def the status bits, because the new value depends on the previous value. You can think of an FP operation as a logical OR acting on the status register. Many kinds of code motion are legal (e.g. you can reorder FP operations with each other, or hoist them out of loops). But if you omit the use, other optimizations won't work correctly; for example, dead code elimination will eliminate FP operations which have a visible effect on the status register.

Given that, I'm not sure what splitting the status register buys you; I guess it becomes easier to check whether an instruction modifies the control bits?

My goal was to restrict the motion of FP operations relative to the instructions that set the control bits or read the status bits without imposing unnecessary restrictions. I believe we agreed when this was discussed previously that the order of exceptions does not need to be preserved, so long as all exceptions are accounted for in the status bits.

I am indeed introducing this register modeling with a view to supporting strict floating point semantics. Initially I intended to model MXCSR use as you indicated, with all strict FP operations having an implicit use and def of this register. However, I was having problems finding a clean way to communicate the fact that an operation required strict semantics across the ISel boundary. This split register modeling was an attempt to get something that was "strict enough" for the strict case without imposing a restriction on the default case.

I suppose you are correct that this has a vulnerability to operations being DCE'd. I'm not sure preserving exception status from floating point operations whose results are never used is a critical use case. I guess that depends on the way we document the semantics of strict FP support. I'll have to think about that.

There is also a question to be resolved as to which function calls should act as barriers and which should not, or in the present context I suppose that equates to which should clobber MXCSR and which should not..

I guess it isn't a backward-compatibility problem if nothing is actually using it yet, I guess. Still, it would be nice to make clobbering "mxcsr" do the obvious, correct thing, as opposed to splitting it into registers which don't actually exist.

I didn't think these were supported for modelling inline asm constraints

Support for what, exactly?

I suppose you are correct that this has a vulnerability to operations being DCE'd. I'm not sure preserving exception status from floating point operations whose results are never used is a critical use case. I guess that depends on the way we document the semantics of strict FP support. I'll have to think about that.

It's not just DCE which is problematic... we could also sink a floating-point operation past a read from the status register. I suppose you could prevent that particular problem by making reads from the status register write to the control bits, but that causes its own problems.

In D30661#693861, @efriedma wrote:

It's not just DCE which is problematic... we could also sink a floating-point operation past a read from the status register. I suppose you could prevent that particular problem by making reads from the status register write to the control bits, but that causes its own problems.

Well, that's exactly the sort of thing I am trying to stop. I was under the impression that if all FP operations have an implicit def of the status bits that would be sufficient to prevent the from being sunk past a read of the status bits. Are you saying that if I have two FP operations with defs of 'mxcsr_s' the back end will be free to assume that the first one can sink past the second operation and a subsequent read of the status bits?

Well, that's exactly the sort of thing I am trying to stop. I was under the impression that if all FP operations have an implicit def of the status bits that would be sufficient to prevent the from being sunk past a read of the status bits. Are you saying that if I have two FP operations with defs of 'mxcsr_s' the back end will be free to assume that the first one can sink past the second operation and a subsequent read of the status bits?

Yes. "def" means completely overwriting the old value, so if you have two operations which def a register, the first definition is dead (whether or not the instruction is dead as a whole). You might want to look at how the x86 backend models arithmetic instructions which set EFLAGS to see how this works in practice; the scheduler will, for example, move an ADD across a CMP+CMOV.

OK, so maybe that puts me back in the position of needing to find a way to conditionally add the MXCSR use/def information only when the strict semantics are required, in which case there would be no significant advantage to splitting the register as I'm proposing here.

It looks like I need to rethink this.

Revision Contents

Path

Size

lib/

Target/

X86/

X86InstrFPStack.td

4 lines

X86InstrSSE.td

8 lines

X86RegisterInfo.td

9 lines

test/

CodeGen/

X86/

ipra-reg-usage.ll

2 lines

Diff 90718

lib/Target/X86/X86InstrFPStack.td

	Show First 20 Lines • Show All 661 Lines • ▼ Show 20 Lines
	def FPREM : I<0xD9, MRM_F8, (outs), (ins), "fprem", [], IIC_FPREM>;			def FPREM : I<0xD9, MRM_F8, (outs), (ins), "fprem", [], IIC_FPREM>;
	def FYL2XP1 : I<0xD9, MRM_F9, (outs), (ins), "fyl2xp1", [], IIC_FYL2XP1>;			def FYL2XP1 : I<0xD9, MRM_F9, (outs), (ins), "fyl2xp1", [], IIC_FYL2XP1>;
	def FSINCOS : I<0xD9, MRM_FB, (outs), (ins), "fsincos", [], IIC_FSINCOS>;			def FSINCOS : I<0xD9, MRM_FB, (outs), (ins), "fsincos", [], IIC_FSINCOS>;
	def FRNDINT : I<0xD9, MRM_FC, (outs), (ins), "frndint", [], IIC_FRNDINT>;			def FRNDINT : I<0xD9, MRM_FC, (outs), (ins), "frndint", [], IIC_FRNDINT>;
	def FSCALE : I<0xD9, MRM_FD, (outs), (ins), "fscale", [], IIC_FSCALE>;			def FSCALE : I<0xD9, MRM_FD, (outs), (ins), "fscale", [], IIC_FSCALE>;
	def FCOMPP : I<0xDE, MRM_D9, (outs), (ins), "fcompp", [], IIC_FCOMPP>;			def FCOMPP : I<0xDE, MRM_D9, (outs), (ins), "fcompp", [], IIC_FCOMPP>;

	let Predicates = [HasFXSR] in {			let Predicates = [HasFXSR] in {
	let Uses = [MXCSR] in {			let Uses = [MXCSR_C, MXCSR_S] in {
	def FXSAVE : I<0xAE, MRM0m, (outs), (ins opaque512mem:$dst),			def FXSAVE : I<0xAE, MRM0m, (outs), (ins opaque512mem:$dst),
	"fxsave\t$dst", [(int_x86_fxsave addr:$dst)], IIC_FXSAVE>, TB;			"fxsave\t$dst", [(int_x86_fxsave addr:$dst)], IIC_FXSAVE>, TB;
	def FXSAVE64 : RI<0xAE, MRM0m, (outs), (ins opaque512mem:$dst),			def FXSAVE64 : RI<0xAE, MRM0m, (outs), (ins opaque512mem:$dst),
	"fxsave64\t$dst", [(int_x86_fxsave64 addr:$dst)],			"fxsave64\t$dst", [(int_x86_fxsave64 addr:$dst)],
	IIC_FXSAVE>, TB, Requires<[In64BitMode]>;			IIC_FXSAVE>, TB, Requires<[In64BitMode]>;
	}			}
	let Defs = [MXCSR] in {			let Defs = [MXCSR_C, MXCSR_S] in {
	def FXRSTOR : I<0xAE, MRM1m, (outs), (ins opaque512mem:$src),			def FXRSTOR : I<0xAE, MRM1m, (outs), (ins opaque512mem:$src),
	"fxrstor\t$src", [(int_x86_fxrstor addr:$src)], IIC_FXRSTOR>,			"fxrstor\t$src", [(int_x86_fxrstor addr:$src)], IIC_FXRSTOR>,
	TB;			TB;
	def FXRSTOR64 : RI<0xAE, MRM1m, (outs), (ins opaque512mem:$src),			def FXRSTOR64 : RI<0xAE, MRM1m, (outs), (ins opaque512mem:$src),
	"fxrstor64\t$src", [(int_x86_fxrstor64 addr:$src)],			"fxrstor64\t$src", [(int_x86_fxrstor64 addr:$src)],
	IIC_FXRSTOR>, TB, Requires<[In64BitMode]>;			IIC_FXRSTOR>, TB, Requires<[In64BitMode]>;
	}			}
	} // Predicates = [FeatureFXSR]			} // Predicates = [FeatureFXSR]
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 3,731 Lines • ▼ Show 20 Lines
	} // SchedRW			} // SchedRW

	def : Pat<(X86MFence), (MFENCE)>;			def : Pat<(X86MFence), (MFENCE)>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// SSE 1 & 2 - Load/Store XCSR register			// SSE 1 & 2 - Load/Store XCSR register
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	let Defs = [MXCSR] in			let Defs = [MXCSR_C, MXCSR_S] in
	def VLDMXCSR : VPSI<0xAE, MRM2m, (outs), (ins i32mem:$src),			def VLDMXCSR : VPSI<0xAE, MRM2m, (outs), (ins i32mem:$src),
	"ldmxcsr\t$src", [(int_x86_sse_ldmxcsr addr:$src)],			"ldmxcsr\t$src", [(int_x86_sse_ldmxcsr addr:$src)],
	IIC_SSE_LDMXCSR>, VEX, Sched<[WriteLoad]>, VEX_WIG;			IIC_SSE_LDMXCSR>, VEX, Sched<[WriteLoad]>, VEX_WIG;
	let Uses = [MXCSR] in			let Uses = [MXCSR_C, MXCSR_S] in
	def VSTMXCSR : VPSI<0xAE, MRM3m, (outs), (ins i32mem:$dst),			def VSTMXCSR : VPSI<0xAE, MRM3m, (outs), (ins i32mem:$dst),
	"stmxcsr\t$dst", [(int_x86_sse_stmxcsr addr:$dst)],			"stmxcsr\t$dst", [(int_x86_sse_stmxcsr addr:$dst)],
	IIC_SSE_STMXCSR>, VEX, Sched<[WriteStore]>, VEX_WIG;			IIC_SSE_STMXCSR>, VEX, Sched<[WriteStore]>, VEX_WIG;

	let Predicates = [UseSSE1] in {			let Predicates = [UseSSE1] in {
	let Defs = [MXCSR] in			let Defs = [MXCSR_C, MXCSR_S] in
	def LDMXCSR : I<0xAE, MRM2m, (outs), (ins i32mem:$src),			def LDMXCSR : I<0xAE, MRM2m, (outs), (ins i32mem:$src),
	"ldmxcsr\t$src", [(int_x86_sse_ldmxcsr addr:$src)],			"ldmxcsr\t$src", [(int_x86_sse_ldmxcsr addr:$src)],
	IIC_SSE_LDMXCSR>, TB, Sched<[WriteLoad]>;			IIC_SSE_LDMXCSR>, TB, Sched<[WriteLoad]>;
	let Uses = [MXCSR] in			let Uses = [MXCSR_C, MXCSR_S] in
	def STMXCSR : I<0xAE, MRM3m, (outs), (ins i32mem:$dst),			def STMXCSR : I<0xAE, MRM3m, (outs), (ins i32mem:$dst),
	"stmxcsr\t$dst", [(int_x86_sse_stmxcsr addr:$dst)],			"stmxcsr\t$dst", [(int_x86_sse_stmxcsr addr:$dst)],
	IIC_SSE_STMXCSR>, TB, Sched<[WriteStore]>;			IIC_SSE_STMXCSR>, TB, Sched<[WriteStore]>;
	}			}

	//===---------------------------------------------------------------------===//			//===---------------------------------------------------------------------===//
	// SSE2 - Move Aligned/Unaligned Packed Integer Instructions			// SSE2 - Move Aligned/Unaligned Packed Integer Instructions
	//===---------------------------------------------------------------------===//			//===---------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 5,027 Lines • Show Last 20 Lines

lib/Target/X86/X86RegisterInfo.td

	Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines

	// Floating-point status word			// Floating-point status word
	def FPSW : X86Reg<"fpsw", 0>;			def FPSW : X86Reg<"fpsw", 0>;

	// Status flags register			// Status flags register
	def EFLAGS : X86Reg<"flags", 0>;			def EFLAGS : X86Reg<"flags", 0>;

	// SSE floating point control/status register			// SSE floating point control/status register
	def MXCSR : X86Reg<"mxcsr", 0>;			// Although MXCSR is actually a single register we model the control bits
				// separately from the status bits in order to avoid unnecessary dependencies.
				def MXCSR_C : X86Reg<"mxcsr_c", 0>;
				def MXCSR_S : X86Reg<"mxcsr_s", 0>;

	// Segment registers			// Segment registers
	def CS : X86Reg<"cs", 1>;			def CS : X86Reg<"cs", 1>;
	def DS : X86Reg<"ds", 3>;			def DS : X86Reg<"ds", 3>;
	def SS : X86Reg<"ss", 2>;			def SS : X86Reg<"ss", 2>;
	def ES : X86Reg<"es", 0>;			def ES : X86Reg<"es", 0>;
	def FS : X86Reg<"fs", 4>;			def FS : X86Reg<"fs", 4>;
	def GS : X86Reg<"gs", 5>;			def GS : X86Reg<"gs", 5>;
	▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines
	def CCR : RegisterClass<"X86", [i32], 32, (add EFLAGS)> {			def CCR : RegisterClass<"X86", [i32], 32, (add EFLAGS)> {
	let CopyCost = -1; // Don't allow copying of status registers.			let CopyCost = -1; // Don't allow copying of status registers.
	let isAllocatable = 0;			let isAllocatable = 0;
	}			}
	def FPCCR : RegisterClass<"X86", [i16], 16, (add FPSW)> {			def FPCCR : RegisterClass<"X86", [i16], 16, (add FPSW)> {
	let CopyCost = -1; // Don't allow copying of status registers.			let CopyCost = -1; // Don't allow copying of status registers.
	let isAllocatable = 0;			let isAllocatable = 0;
	}			}
				def MXCSCR : RegisterClass<"X86", [i16], 16, (add MXCSR_C, MXCSR_S)> {
				let CopyCost = -1; // Don't allow copying of MXCSR.
				let isAllocatable = 0;
				}

	// AVX-512 vector/mask registers.			// AVX-512 vector/mask registers.
	def VR512 : RegisterClass<"X86", [v16f32, v8f64, v64i8, v32i16, v16i32, v8i64],			def VR512 : RegisterClass<"X86", [v16f32, v8f64, v64i8, v32i16, v16i32, v8i64],
	512, (sequence "ZMM%u", 0, 31)>;			512, (sequence "ZMM%u", 0, 31)>;

	// Scalar AVX-512 floating point registers.			// Scalar AVX-512 floating point registers.
	def FR32X : RegisterClass<"X86", [f32], 32, (sequence "XMM%u", 0, 31)>;			def FR32X : RegisterClass<"X86", [f32], 32, (sequence "XMM%u", 0, 31)>;

	Show All 27 Lines

test/CodeGen/X86/ipra-reg-usage.ll

	; RUN: llc -enable-ipra -print-regusage -o /dev/null 2>&1 < %s \| FileCheck %s			; RUN: llc -enable-ipra -print-regusage -o /dev/null 2>&1 < %s \| FileCheck %s

	target triple = "x86_64-unknown-unknown"			target triple = "x86_64-unknown-unknown"
	declare void @bar1()			declare void @bar1()
	define preserve_allcc void @foo()#0 {			define preserve_allcc void @foo()#0 {
	; CHECK: foo Clobbered Registers: CS DS EFLAGS EIP EIZ ES FPSW FS GS IP MXCSR RIP RIZ SS BND0 BND1 BND2 BND3 CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 CR8 CR9 CR10 CR11 CR12 CR13 CR14 CR15 DR0 DR1 DR2 DR3 DR4 DR5 DR6 DR7 DR8 DR9 DR10 DR11 DR12 DR13 DR14 DR15 FP0 FP1 FP2 FP3 FP4 FP5 FP6 FP7 K0 K1 K2 K3 K4 K5 K6 K7 MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7 R11 ST0 ST1 ST2 ST3 ST4 ST5 ST6 ST7 XMM16 XMM17 XMM18 XMM19 XMM20 XMM21 XMM22 XMM23 XMM24 XMM25 XMM26 XMM27 XMM28 XMM29 XMM30 XMM31 YMM0 YMM1 YMM2 YMM3 YMM4 YMM5 YMM6 YMM7 YMM8 YMM9 YMM10 YMM11 YMM12 YMM13 YMM14 YMM15 YMM16 YMM17 YMM18 YMM19 YMM20 YMM21 YMM22 YMM23 YMM24 YMM25 YMM26 YMM27 YMM28 YMM29 YMM30 YMM31 ZMM0 ZMM1 ZMM2 ZMM3 ZMM4 ZMM5 ZMM6 ZMM7 ZMM8 ZMM9 ZMM10 ZMM11 ZMM12 ZMM13 ZMM14 ZMM15 ZMM16 ZMM17 ZMM18 ZMM19 ZMM20 ZMM21 ZMM22 ZMM23 ZMM24 ZMM25 ZMM26 ZMM27 ZMM28 ZMM29 ZMM30 ZMM31 R11B R11D R11W			; CHECK: foo Clobbered Registers: CS DS EFLAGS EIP EIZ ES FPSW FS GS IP MXCSR_C MXCSR_S RIP RIZ SS BND0 BND1 BND2 BND3 CR0 CR1 CR2 CR3 CR4 CR5 CR6 CR7 CR8 CR9 CR10 CR11 CR12 CR13 CR14 CR15 DR0 DR1 DR2 DR3 DR4 DR5 DR6 DR7 DR8 DR9 DR10 DR11 DR12 DR13 DR14 DR15 FP0 FP1 FP2 FP3 FP4 FP5 FP6 FP7 K0 K1 K2 K3 K4 K5 K6 K7 MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7 R11 ST0 ST1 ST2 ST3 ST4 ST5 ST6 ST7 XMM16 XMM17 XMM18 XMM19 XMM20 XMM21 XMM22 XMM23 XMM24 XMM25 XMM26 XMM27 XMM28 XMM29 XMM30 XMM31 YMM0 YMM1 YMM2 YMM3 YMM4 YMM5 YMM6 YMM7 YMM8 YMM9 YMM10 YMM11 YMM12 YMM13 YMM14 YMM15 YMM16 YMM17 YMM18 YMM19 YMM20 YMM21 YMM22 YMM23 YMM24 YMM25 YMM26 YMM27 YMM28 YMM29 YMM30 YMM31 ZMM0 ZMM1 ZMM2 ZMM3 ZMM4 ZMM5 ZMM6 ZMM7 ZMM8 ZMM9 ZMM10 ZMM11 ZMM12 ZMM13 ZMM14 ZMM15 ZMM16 ZMM17 ZMM18 ZMM19 ZMM20 ZMM21 ZMM22 ZMM23 ZMM24 ZMM25 ZMM26 ZMM27 ZMM28 ZMM29 ZMM30 ZMM31 R11B R11D R11W
	call void @bar1()			call void @bar1()
	call void @bar2()			call void @bar2()
	ret void			ret void
	}			}
	declare void @bar2()			declare void @bar2()
	attributes #0 = {nounwind}			attributes #0 = {nounwind}