This is an archive of the discontinued LLVM Phabricator instance.

[Strict FP] Allow more relaxed scheduling
ClosedPublic

Authored by uweigand on Jul 9 2019, 6:15 AM.

Download Raw Diff

Details

Reviewers

andrew.w.kaylor
cameron.mcinally
kpn
kbarton
hfinkel

Commits

rG450c62e33ea5: [Strict FP] Allow more relaxed scheduling
rL366222: [Strict FP] Allow more relaxed scheduling

Summary

Support for strict floating-point instructions at the DAG/MI level, as recently introduced in https://reviews.llvm.org/D55506, constrains instruction scheduling for such instruction to enforce their original source order. While this mirrors the current requirements on strict FP intrinsics at the LLVM IR level, I believe this is really more strict than would be required to implement the semantics of strict FP.

Specifically, I believe it should be allowed to move one strict FP instructions across another, as long as it is not moved across any global barrier. If both instructions were to raise a trapping FP exception, this means that you may now see another of those exceptions first, but that should still be OK.

This patch provides an alternative implementation in ScheduleDAGInstrs::buildSchedGraph that implements this relaxed constraint. This means that instruction scheduling for strict FP instructions is now nearly as flexible as for standard FP instructions, removing a bit of the extra performance overhead.

Diff Detail

Event Timeline

uweigand created this revision.Jul 9 2019, 6:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2019, 6:15 AM

Herald added subscribers: llvm-commits, MatzeB. · View Herald Transcript

jsji added a subscriber: • wuzish.Jul 9 2019, 6:40 AM

Stupid question: what's a "global barrier"?

This isn't going to be moving instructions outside of tests, right? For example:

int foo(double d) {

return (isnan(d) ? 0 : (int)d);

}

A year ago this caused traps because of speculative execution causing both legs of the ternary operator to be executed. I feel a little silly asking, but ... I'm asking anyway.

Also, how does this patch interact with volatile accesses? We use volatile to finesse compilers into doing what we need.

This scheduler generally only operates on single basic blocks, so it would not move anything outside of a test.

It also will not move anything across a volatile memory access, since that it one of the global barriers. (Those are: calls, instructions with "unmodeled side effects", and volatile/atomic memory accesses.)

Ping? It would be good to get this in LLVM 9 ...

I'm reviewing the trap-safety issues now and have some open questions (language lawyers needed??). Let's say we have something like this:

z = strict_fmul x, y
c = strict_fmul a, b
store z
store c

And we schedule it as:

z = strict_fmul x, y
store z
c = strict_fmul a, b
store c

Now let's say the 2nd fmul traps on overflow and we have a signal handler set up to gracefully recover. The differences in scheduling could mean memory differences when the signal handler executes. That might not be okay for the very strictest conformance mode (I don't know).

Are we treating this case as undefined behavior, like the C/C++ Standards dictate?

Bah, my last comment was flawed! I read the test cases incorrectly and missed the 'fpexcept.ignore' on some of them.

But I think the question is still partially valid. What defines sequencing traps and stores? Are we (LLVM) defining something more strict than IEEE-754 and the C/C++ Standards?

My understanding is that IEEE-754 requires us to produce an exception if and only if the exception would be produced by a literal interpretation of the source. However, it does not require that the exceptions be raised in the same order as implied by the source. Also, that's what the LLVM language reference says we'll do with "fpexcept.strict" -- "The number and order of floating-point exceptions is NOT guaranteed." So, I think the changes you've got here are correct.

I think I'll have to challenge you a little here. ;)

In D64412#1586343, @andrew.w.kaylor wrote:

My understanding is that IEEE-754 requires us to produce an exception if and only if the exception would be produced by a literal interpretation of the source.

The literal interpretation language refers to value-changing optimizations. I don't think it specifies memory ordering though. I could be wrong...

However, it does not require that the exceptions be raised in the same order as implied by the source. Also, that's what the LLVM language reference says we'll do with "fpexcept.strict" -- "The number and order of floating-point exceptions is NOT guaranteed." So, I think the changes you've got here are correct.

That's actually slightly different than what I'm asking -- that's about ordering two trapping operations (I'm not sure where that's specified as ok either), not ordering one trapping operation and a store.

In other words, what happens if we move stores around operations that can trap? I could easily write a small program to give different results based on whether this scheduling change is active or not. Is there somewhere that says different results are ok with same source?

I'm assuming we're treating it as undefined behavior, like the C/C++ Standards state, so that all this doesn't matter. Just want to confirm that we're not mistakenly throwing away strictness.

Just to clarify one thing: even the current implementation, before this patch, does not guarantee the relative order of FP instructions and memory instructions is unchanged. So even the current implementation may perform the reschedule your comment mentions. This patch would add the additional option of also changing the relative order of the two strict_fmul operations.

I do not think there is much point in attempting to guarantee the relative order of FP vs. memory instructions, since those memory instructions are themselves not guaranteed (the C/C++ standard allows memory accesses to be rather freely rescheduled, or even fully omitted).

If relative order of FP vs. memory instructions is an issue to your application, you'll have to use volatile (or atomic) memory accesses; in that case, both the current implementation and my patch will respect the ordering.

In D64412#1587016, @uweigand wrote:

This patch would add the additional option of also changing the relative order of the two strict_fmul operations.

I now see that the IEEE-754 Standard allows for expression transformations that change the order of setting flags, so that should be fine for statements too.

The following value-changing transformations, among others, preserve the literal meaning of the source code:
<...snip...>
― Changing the order in which different flags are raised.

I do not think there is much point in attempting to guarantee the relative order of FP vs. memory instructions, since those memory instructions are themselves not guaranteed (the C/C++ standard allows memory accesses to be rather freely rescheduled, or even fully omitted).

If relative order of FP vs. memory instructions is an issue to your application, you'll have to use volatile (or atomic) memory accesses; in that case, both the current implementation and my patch will respect the ordering.

That's fair. I don't see anything explicitly disallowing it, so can't argue.

This revision is now accepted and ready to land.Jul 16 2019, 7:31 AM

Closed by commit rL366222: [Strict FP] Allow more relaxed scheduling (authored by uweigand). · Explain WhyJul 16 2019, 8:59 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

CodeGen/

ScheduleDAGInstrs.cpp

31 lines

test/

CodeGen/

SystemZ/

fp-strict-alias.ll

222 lines

vector-constrained-fp-intrinsics.ll

30 lines

Diff 208659

lib/CodeGen/ScheduleDAGInstrs.cpp

Show First 20 Lines • Show All 706 Lines • ▼ Show 20 Lines	void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
LiveIntervals *LIS,		LiveIntervals *LIS,
bool TrackLaneMasks) {		bool TrackLaneMasks) {
const TargetSubtargetInfo &ST = MF.getSubtarget();		const TargetSubtargetInfo &ST = MF.getSubtarget();
bool UseAA = EnableAASchedMI.getNumOccurrences() > 0 ? EnableAASchedMI		bool UseAA = EnableAASchedMI.getNumOccurrences() > 0 ? EnableAASchedMI
: ST.useAA();		: ST.useAA();
AAForDep = UseAA ? AA : nullptr;		AAForDep = UseAA ? AA : nullptr;

BarrierChain = nullptr;		BarrierChain = nullptr;
SUnit *FPBarrierChain = nullptr;

this->TrackLaneMasks = TrackLaneMasks;		this->TrackLaneMasks = TrackLaneMasks;
MISUnitMap.clear();		MISUnitMap.clear();
ScheduleDAG::clearDAG();		ScheduleDAG::clearDAG();

// Create an SUnit for each real instruction.		// Create an SUnit for each real instruction.
initSUnits();		initSUnits();

Show All 15 Lines	void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
// or Loads, and have therefore their own 'NonAlias'		// or Loads, and have therefore their own 'NonAlias'
// domain. E.g. spill / reload instructions never alias LLVM I/R		// domain. E.g. spill / reload instructions never alias LLVM I/R
// Values. It would be nice to assume that this type of memory		// Values. It would be nice to assume that this type of memory
// accesses always have a proper memory operand modelling, and are		// accesses always have a proper memory operand modelling, and are
// therefore never unanalyzable, but this is conservatively not		// therefore never unanalyzable, but this is conservatively not
// done.		// done.
Value2SUsMap NonAliasStores, NonAliasLoads(1 /TrueMemOrderLatency/);		Value2SUsMap NonAliasStores, NonAliasLoads(1 /TrueMemOrderLatency/);

		// Track all instructions that may raise floating-point exceptions.
		// These do not depend on one other (or normal loads or stores), but
		// must not be rescheduled across global barriers. Note that we don't
		// really need a "map" here since we don't track those MIs by value;
		// using the same Value2SUsMap data type here is simply a matter of
		// convenience.
		Value2SUsMap FPExceptions;

// Remove any stale debug info; sometimes BuildSchedGraph is called again		// Remove any stale debug info; sometimes BuildSchedGraph is called again
// without emitting the info from the previous call.		// without emitting the info from the previous call.
DbgValues.clear();		DbgValues.clear();
FirstDbgValue = nullptr;		FirstDbgValue = nullptr;

assert(Defs.empty() && Uses.empty() &&		assert(Defs.empty() && Uses.empty() &&
"Only BuildGraph should update Defs/Uses");		"Only BuildGraph should update Defs/Uses");
Defs.setUniverse(TRI->getNumRegs());		Defs.setUniverse(TRI->getNumRegs());
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	if (isGlobalMemoryObject(AA, &MI)) {
LLVM_DEBUG(dbgs() << "Global memory object and new barrier chain: SU("		LLVM_DEBUG(dbgs() << "Global memory object and new barrier chain: SU("
<< BarrierChain->NodeNum << ").\n";);		<< BarrierChain->NodeNum << ").\n";);

// Add dependencies against everything below it and clear maps.		// Add dependencies against everything below it and clear maps.
addBarrierChain(Stores);		addBarrierChain(Stores);
addBarrierChain(Loads);		addBarrierChain(Loads);
addBarrierChain(NonAliasStores);		addBarrierChain(NonAliasStores);
addBarrierChain(NonAliasLoads);		addBarrierChain(NonAliasLoads);
		addBarrierChain(FPExceptions);
// Add dependency against previous FP barrier and reset FP barrier.
if (FPBarrierChain)
FPBarrierChain->addPredBarrier(BarrierChain);
FPBarrierChain = BarrierChain;

continue;		continue;
}		}

// Instructions that may raise FP exceptions depend on each other.		// Instructions that may raise FP exceptions may not be moved
		// across any global barriers.
if (MI.mayRaiseFPException()) {		if (MI.mayRaiseFPException()) {
if (FPBarrierChain)		if (BarrierChain)
FPBarrierChain->addPredBarrier(SU);		BarrierChain->addPredBarrier(SU);
FPBarrierChain = SU;
		FPExceptions.insert(SU, UnknownValue);

		if (FPExceptions.size() >= HugeRegion) {
		LLVM_DEBUG(dbgs() << "Reducing FPExceptions map.\n";);
		Value2SUsMap empty;
		reduceHugeMemNodeMaps(FPExceptions, empty, getReductionSize());
		}
}		}

// If it's not a store or a variant load, we're done.		// If it's not a store or a variant load, we're done.
if (!MI.mayStore() &&		if (!MI.mayStore() &&
!(MI.mayLoad() && !MI.isDereferenceableInvariantLoad(AA)))		!(MI.mayLoad() && !MI.isDereferenceableInvariantLoad(AA)))
continue;		continue;

// Always add dependecy edge to BarrierChain if present.		// Always add dependecy edge to BarrierChain if present.
▲ Show 20 Lines • Show All 597 Lines • Show Last 20 Lines

test/CodeGen/SystemZ/fp-strict-alias.ll

	; Verify that strict FP operations are not rescheduled			; Verify that strict FP operations are not rescheduled
	;			;
	; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s			; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s

	declare float @llvm.experimental.constrained.fadd.f32(float, float, metadata, metadata)
	declare float @llvm.experimental.constrained.fsub.f32(float, float, metadata, metadata)
	declare float @llvm.experimental.constrained.sqrt.f32(float, metadata, metadata)			declare float @llvm.experimental.constrained.sqrt.f32(float, metadata, metadata)
	declare float @llvm.sqrt.f32(float)			declare float @llvm.sqrt.f32(float)
	declare void @llvm.s390.sfpc(i32)			declare void @llvm.s390.sfpc(i32)

	; For non-strict operations, we expect the post-RA scheduler to			; The basic assumption of all following tests is that on z13, we never
	; separate the two square root instructions on z13.			; want to see two square root instructions directly in a row, so the
	define void @f1(float %f1, float %f2, float %f3, float %f4, float *%ptr0) {			; post-RA scheduler will always schedule something else in between
				; whenever possible.

				; We can move any FP operation across a (normal) store.

				define void @f1(float %f1, float %f2, float %ptr1, float %ptr2) {
	; CHECK-LABEL: f1:			; CHECK-LABEL: f1:
	; CHECK: sqebr			; CHECK: sqebr
	; CHECK: {{aebr\|sebr}}			; CHECK: ste
	; CHECK: sqebr			; CHECK: sqebr
				; CHECK: ste
	; CHECK: br %r14			; CHECK: br %r14

	%add = fadd float %f1, %f2			%sqrt1 = call float @llvm.sqrt.f32(float %f1)
	%sub = fsub float %f3, %f4			%sqrt2 = call float @llvm.sqrt.f32(float %f2)
	%sqrt1 = call float @llvm.sqrt.f32(float %f2)
	%sqrt2 = call float @llvm.sqrt.f32(float %f4)

	%ptr1 = getelementptr float, float *%ptr0, i64 1
	%ptr2 = getelementptr float, float *%ptr0, i64 2
	%ptr3 = getelementptr float, float *%ptr0, i64 3

	store float %add, float *%ptr0			store float %sqrt1, float *%ptr1
	store float %sub, float *%ptr1			store float %sqrt2, float *%ptr2
	store float %sqrt1, float *%ptr2
	store float %sqrt2, float *%ptr3

	ret void			ret void
	}			}

	; But for strict operations, this must not happen.			define void @f2(float %f1, float %f2, float %ptr1, float %ptr2) {
	define void @f2(float %f1, float %f2, float %f3, float %f4, float *%ptr0) {
	; CHECK-LABEL: f2:			; CHECK-LABEL: f2:
	; CHECK: {{aebr\|sebr}}
	; CHECK: {{aebr\|sebr}}
	; CHECK: sqebr			; CHECK: sqebr
				; CHECK: ste
	; CHECK: sqebr			; CHECK: sqebr
				; CHECK: ste
	; CHECK: br %r14			; CHECK: br %r14

	%add = call float @llvm.experimental.constrained.fadd.f32(			%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f1, float %f2,			float %f1,
	metadata !"round.dynamic",			metadata !"round.dynamic",
	metadata !"fpexcept.strict")			metadata !"fpexcept.ignore")
	%sub = call float @llvm.experimental.constrained.fsub.f32(			%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f3, float %f4,			float %f2,
	metadata !"round.dynamic",			metadata !"round.dynamic",
	metadata !"fpexcept.strict")			metadata !"fpexcept.ignore")

				store float %sqrt1, float *%ptr1
				store float %sqrt2, float *%ptr2

				ret void
				}

				define void @f3(float %f1, float %f2, float %ptr1, float %ptr2) {
				; CHECK-LABEL: f3:
				; CHECK: sqebr
				; CHECK: ste
				; CHECK: sqebr
				; CHECK: ste
				; CHECK: br %r14

	%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(			%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f2,			float %f1,
	metadata !"round.dynamic",			metadata !"round.dynamic",
	metadata !"fpexcept.strict")			metadata !"fpexcept.strict")
	%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(			%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f4,			float %f2,
	metadata !"round.dynamic",			metadata !"round.dynamic",
	metadata !"fpexcept.strict")			metadata !"fpexcept.strict")

	%ptr1 = getelementptr float, float *%ptr0, i64 1			store float %sqrt1, float *%ptr1
	%ptr2 = getelementptr float, float *%ptr0, i64 2			store float %sqrt2, float *%ptr2
	%ptr3 = getelementptr float, float *%ptr0, i64 3

	store float %add, float *%ptr0			ret void
	store float %sub, float *%ptr1			}
	store float %sqrt1, float *%ptr2
	store float %sqrt2, float *%ptr3
				; We can move a non-strict FP operation or a fpexcept.ignore
				; operation even across a volatile store, but not a fpexcept.strict
				; operation.

				define void @f4(float %f1, float %f2, float %ptr1, float %ptr2) {
				; CHECK-LABEL: f4:
				; CHECK: sqebr
				; CHECK: ste
				; CHECK: sqebr
				; CHECK: ste
				; CHECK: br %r14

				%sqrt1 = call float @llvm.sqrt.f32(float %f1)
				%sqrt2 = call float @llvm.sqrt.f32(float %f2)

				store volatile float %sqrt1, float *%ptr1
				store volatile float %sqrt2, float *%ptr2

	ret void			ret void
	}			}

	; On the other hand, strict operations that use the fpexcept.ignore			define void @f5(float %f1, float %f2, float %ptr1, float %ptr2) {
	; exception behaviour should be scheduled freely.			; CHECK-LABEL: f5:
	define void @f3(float %f1, float %f2, float %f3, float %f4, float *%ptr0) {
	; CHECK-LABEL: f3:
	; CHECK: sqebr			; CHECK: sqebr
	; CHECK: {{aebr\|sebr}}			; CHECK: ste
	; CHECK: sqebr			; CHECK: sqebr
				; CHECK: ste
	; CHECK: br %r14			; CHECK: br %r14

	%add = call float @llvm.experimental.constrained.fadd.f32(			%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f1, float %f2,			float %f1,
	metadata !"round.dynamic",			metadata !"round.dynamic",
	metadata !"fpexcept.ignore")			metadata !"fpexcept.ignore")
	%sub = call float @llvm.experimental.constrained.fsub.f32(			%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f3, float %f4,			float %f2,
	metadata !"round.dynamic",			metadata !"round.dynamic",
	metadata !"fpexcept.ignore")			metadata !"fpexcept.ignore")

				store volatile float %sqrt1, float *%ptr1
				store volatile float %sqrt2, float *%ptr2

				ret void
				}

				define void @f6(float %f1, float %f2, float %ptr1, float %ptr2) {
				; CHECK-LABEL: f6:
				; CHECK: sqebr
				; CHECK: sqebr
				; CHECK: ste
				; CHECK: ste
				; CHECK: br %r14

	%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(			%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(
				float %f1,
				metadata !"round.dynamic",
				metadata !"fpexcept.strict")
				%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f2,			float %f2,
	metadata !"round.dynamic",			metadata !"round.dynamic",
				metadata !"fpexcept.strict")

				store volatile float %sqrt1, float *%ptr1
				store volatile float %sqrt2, float *%ptr2

				ret void
				}


				; No variant of FP operations can be scheduled across a SPFC.

				define void @f7(float %f1, float %f2, float %ptr1, float %ptr2) {
				; CHECK-LABEL: f7:
				; CHECK: sqebr
				; CHECK: sqebr
				; CHECK: ste
				; CHECK: ste
				; CHECK: br %r14

				%sqrt1 = call float @llvm.sqrt.f32(float %f1)
				%sqrt2 = call float @llvm.sqrt.f32(float %f2)

				call void @llvm.s390.sfpc(i32 0)

				store float %sqrt1, float *%ptr1
				store float %sqrt2, float *%ptr2

				ret void
				}

				define void @f8(float %f1, float %f2, float %ptr1, float %ptr2) {
				; CHECK-LABEL: f8:
				; CHECK: sqebr
				; CHECK: sqebr
				; CHECK: ste
				; CHECK: ste
				; CHECK: br %r14

				%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(
				float %f1,
				metadata !"round.dynamic",
	metadata !"fpexcept.ignore")			metadata !"fpexcept.ignore")
	%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(			%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(
	float %f4,			float %f2,
	metadata !"round.dynamic",			metadata !"round.dynamic",
	metadata !"fpexcept.ignore")			metadata !"fpexcept.ignore")

	%ptr1 = getelementptr float, float *%ptr0, i64 1			call void @llvm.s390.sfpc(i32 0)
	%ptr2 = getelementptr float, float *%ptr0, i64 2
	%ptr3 = getelementptr float, float *%ptr0, i64 3

	store float %add, float *%ptr0			store float %sqrt1, float *%ptr1
	store float %sub, float *%ptr1			store float %sqrt2, float *%ptr2
	store float %sqrt1, float *%ptr2
	store float %sqrt2, float *%ptr3

	ret void			ret void
	}			}

	; However, even non-strict operations must not be scheduled across an SFPC.			define void @f9(float %f1, float %f2, float %ptr1, float %ptr2) {
	define void @f4(float %f1, float %f2, float %f3, float %f4, float *%ptr0) {			; CHECK-LABEL: f9:
	; CHECK-LABEL: f4:
	; CHECK: {{aebr\|sebr}}
	; CHECK: {{aebr\|sebr}}
	; CHECK: sfpc
	; CHECK: sqebr			; CHECK: sqebr
	; CHECK: sqebr			; CHECK: sqebr
				; CHECK: ste
				; CHECK: ste
	; CHECK: br %r14			; CHECK: br %r14

	%add = fadd float %f1, %f2			%sqrt1 = call float @llvm.experimental.constrained.sqrt.f32(
	%sub = fsub float %f3, %f4			float %f1,
				metadata !"round.dynamic",
				metadata !"fpexcept.strict")
				%sqrt2 = call float @llvm.experimental.constrained.sqrt.f32(
				float %f2,
				metadata !"round.dynamic",
				metadata !"fpexcept.strict")

	call void @llvm.s390.sfpc(i32 0)			call void @llvm.s390.sfpc(i32 0)
	%sqrt1 = call float @llvm.sqrt.f32(float %f2)
	%sqrt2 = call float @llvm.sqrt.f32(float %f4)

	%ptr1 = getelementptr float, float *%ptr0, i64 1			store float %sqrt1, float *%ptr1
	%ptr2 = getelementptr float, float *%ptr0, i64 2			store float %sqrt2, float *%ptr2
	%ptr3 = getelementptr float, float *%ptr0, i64 3

	store float %add, float *%ptr0
	store float %sub, float *%ptr1
	store float %sqrt1, float *%ptr2
	store float %sqrt2, float *%ptr3

	ret void			ret void
	}			}

test/CodeGen/SystemZ/vector-constrained-fp-intrinsics.ll

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
; S390X-NEXT: ld %f0, 16(%r2)		; S390X-NEXT: ld %f0, 16(%r2)
; S390X-NEXT: ld %f1, 8(%r2)		; S390X-NEXT: ld %f1, 8(%r2)
; S390X-NEXT: larl %r1, .LCPI3_0		; S390X-NEXT: larl %r1, .LCPI3_0
; S390X-NEXT: ldeb %f2, 0(%r1)		; S390X-NEXT: ldeb %f2, 0(%r1)
; S390X-NEXT: larl %r1, .LCPI3_1		; S390X-NEXT: larl %r1, .LCPI3_1
; S390X-NEXT: ldeb %f3, 0(%r1)		; S390X-NEXT: ldeb %f3, 0(%r1)
; S390X-NEXT: larl %r1, .LCPI3_2		; S390X-NEXT: larl %r1, .LCPI3_2
; S390X-NEXT: ldeb %f4, 0(%r1)		; S390X-NEXT: ldeb %f4, 0(%r1)
; S390X-NEXT: ddb %f2, 0(%r2)
; S390X-NEXT: ddbr %f3, %f1		; S390X-NEXT: ddbr %f3, %f1
		; S390X-NEXT: ddb %f2, 0(%r2)
; S390X-NEXT: ddbr %f4, %f0		; S390X-NEXT: ddbr %f4, %f0
; S390X-NEXT: std %f4, 16(%r2)		; S390X-NEXT: std %f4, 16(%r2)
; S390X-NEXT: std %f3, 8(%r2)		; S390X-NEXT: std %f3, 8(%r2)
; S390X-NEXT: std %f2, 0(%r2)		; S390X-NEXT: std %f2, 0(%r2)
; S390X-NEXT: br %r14		; S390X-NEXT: br %r14
;		;
; SZ13-LABEL: constrained_vector_fdiv_v3f64:		; SZ13-LABEL: constrained_vector_fdiv_v3f64:
; SZ13: # %bb.0: # %entry		; SZ13: # %bb.0: # %entry
▲ Show 20 Lines • Show All 533 Lines • ▼ Show 20 Lines	%mul = call <3 x float> @llvm.experimental.constrained.fmul.v3f32(
metadata !"round.dynamic",		metadata !"round.dynamic",
metadata !"fpexcept.strict")		metadata !"fpexcept.strict")
ret <3 x float> %mul		ret <3 x float> %mul
}		}

define void @constrained_vector_fmul_v3f64(<3 x double>* %a) {		define void @constrained_vector_fmul_v3f64(<3 x double>* %a) {
; S390X-LABEL: constrained_vector_fmul_v3f64:		; S390X-LABEL: constrained_vector_fmul_v3f64:
; S390X: # %bb.0: # %entry		; S390X: # %bb.0: # %entry
		; S390X-NEXT: ld %f0, 8(%r2)
; S390X-NEXT: larl %r1, .LCPI13_0		; S390X-NEXT: larl %r1, .LCPI13_0
; S390X-NEXT: ld %f0, 0(%r1)		; S390X-NEXT: ld %f1, 0(%r1)
; S390X-NEXT: ld %f1, 8(%r2)
; S390X-NEXT: ld %f2, 16(%r2)		; S390X-NEXT: ld %f2, 16(%r2)
; S390X-NEXT: ldr %f3, %f0		; S390X-NEXT: mdbr %f0, %f1
		; S390X-NEXT: ldr %f3, %f1
; S390X-NEXT: mdb %f3, 0(%r2)		; S390X-NEXT: mdb %f3, 0(%r2)
; S390X-NEXT: mdbr %f1, %f0		; S390X-NEXT: mdbr %f2, %f1
; S390X-NEXT: mdbr %f2, %f0
; S390X-NEXT: std %f2, 16(%r2)		; S390X-NEXT: std %f2, 16(%r2)
; S390X-NEXT: std %f1, 8(%r2)		; S390X-NEXT: std %f0, 8(%r2)
; S390X-NEXT: std %f3, 0(%r2)		; S390X-NEXT: std %f3, 0(%r2)
; S390X-NEXT: br %r14		; S390X-NEXT: br %r14
;		;
; SZ13-LABEL: constrained_vector_fmul_v3f64:		; SZ13-LABEL: constrained_vector_fmul_v3f64:
; SZ13: # %bb.0: # %entry		; SZ13: # %bb.0: # %entry
; SZ13-NEXT: larl %r1, .LCPI13_0		; SZ13-NEXT: larl %r1, .LCPI13_0
; SZ13-NEXT: ld %f1, 0(%r1)		; SZ13-NEXT: ld %f1, 0(%r1)
; SZ13-NEXT: larl %r1, .LCPI13_1		; SZ13-NEXT: larl %r1, .LCPI13_1
▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	%add = call <3 x float> @llvm.experimental.constrained.fadd.v3f32(
metadata !"round.dynamic",		metadata !"round.dynamic",
metadata !"fpexcept.strict")		metadata !"fpexcept.strict")
ret <3 x float> %add		ret <3 x float> %add
}		}

define void @constrained_vector_fadd_v3f64(<3 x double>* %a) {		define void @constrained_vector_fadd_v3f64(<3 x double>* %a) {
; S390X-LABEL: constrained_vector_fadd_v3f64:		; S390X-LABEL: constrained_vector_fadd_v3f64:
; S390X: # %bb.0: # %entry		; S390X: # %bb.0: # %entry
		; S390X-NEXT: ld %f0, 8(%r2)
; S390X-NEXT: larl %r1, .LCPI18_0		; S390X-NEXT: larl %r1, .LCPI18_0
; S390X-NEXT: ld %f0, 0(%r1)		; S390X-NEXT: ld %f1, 0(%r1)
; S390X-NEXT: ld %f1, 8(%r2)
; S390X-NEXT: ld %f2, 16(%r2)		; S390X-NEXT: ld %f2, 16(%r2)
; S390X-NEXT: ldr %f3, %f0		; S390X-NEXT: adbr %f0, %f1
		; S390X-NEXT: ldr %f3, %f1
; S390X-NEXT: adb %f3, 0(%r2)		; S390X-NEXT: adb %f3, 0(%r2)
; S390X-NEXT: adbr %f1, %f0		; S390X-NEXT: adbr %f2, %f1
; S390X-NEXT: adbr %f2, %f0
; S390X-NEXT: std %f2, 16(%r2)		; S390X-NEXT: std %f2, 16(%r2)
; S390X-NEXT: std %f1, 8(%r2)		; S390X-NEXT: std %f0, 8(%r2)
; S390X-NEXT: std %f3, 0(%r2)		; S390X-NEXT: std %f3, 0(%r2)
; S390X-NEXT: br %r14		; S390X-NEXT: br %r14
;		;
; SZ13-LABEL: constrained_vector_fadd_v3f64:		; SZ13-LABEL: constrained_vector_fadd_v3f64:
; SZ13: # %bb.0: # %entry		; SZ13: # %bb.0: # %entry
; SZ13-NEXT: larl %r1, .LCPI18_0		; SZ13-NEXT: larl %r1, .LCPI18_0
; SZ13-NEXT: ld %f1, 0(%r1)		; SZ13-NEXT: ld %f1, 0(%r1)
; SZ13-NEXT: larl %r1, .LCPI18_1		; SZ13-NEXT: larl %r1, .LCPI18_1
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	entry:
ret <2 x double> %sub		ret <2 x double> %sub
}		}

define <3 x float> @constrained_vector_fsub_v3f32() {		define <3 x float> @constrained_vector_fsub_v3f32() {
; S390X-LABEL: constrained_vector_fsub_v3f32:		; S390X-LABEL: constrained_vector_fsub_v3f32:
; S390X: # %bb.0: # %entry		; S390X: # %bb.0: # %entry
; S390X-NEXT: larl %r1, .LCPI22_0		; S390X-NEXT: larl %r1, .LCPI22_0
; S390X-NEXT: le %f0, 0(%r1)		; S390X-NEXT: le %f0, 0(%r1)
; S390X-NEXT: lzer %f1
; S390X-NEXT: ler %f4, %f0		; S390X-NEXT: ler %f4, %f0
; S390X-NEXT: sebr %f4, %f1
; S390X-NEXT: larl %r1, .LCPI22_1		; S390X-NEXT: larl %r1, .LCPI22_1
; S390X-NEXT: ler %f2, %f0		; S390X-NEXT: ler %f2, %f0
; S390X-NEXT: seb %f2, 0(%r1)		; S390X-NEXT: seb %f2, 0(%r1)
; S390X-NEXT: larl %r1, .LCPI22_2		; S390X-NEXT: larl %r1, .LCPI22_2
; S390X-NEXT: seb %f0, 0(%r1)		; S390X-NEXT: seb %f0, 0(%r1)
		; S390X-NEXT: lzer %f1
		; S390X-NEXT: sebr %f4, %f1
; S390X-NEXT: br %r14		; S390X-NEXT: br %r14
;		;
; SZ13-LABEL: constrained_vector_fsub_v3f32:		; SZ13-LABEL: constrained_vector_fsub_v3f32:
; SZ13: # %bb.0: # %entry		; SZ13: # %bb.0: # %entry
; SZ13-NEXT: vgbm %v2, 15		; SZ13-NEXT: vgbm %v2, 15
; SZ13-NEXT: lzer %f1		; SZ13-NEXT: lzer %f1
; SZ13-NEXT: sebr %f2, %f1		; SZ13-NEXT: sebr %f2, %f1
; SZ13-NEXT: vgmf %v1, 1, 1		; SZ13-NEXT: vgmf %v1, 1, 1
▲ Show 20 Lines • Show All 5,458 Lines • Show Last 20 Lines