This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAGBuilder] Use accumulator value in VECREDUCE_FADD/FMUL
AbandonedPublic

Authored by sdesmalen on Mar 14 2019, 4:21 AM.

Download Raw Diff

Details

Reviewers

aemerson
RKSimon
t.p.northover
nikic
spatel
efriedma

Summary

The accumulator value always seems to be ignored when creating
non-strict VECREDUCE_FADD/FMUL ISD nodes. This patch fixes that.

Diff Detail

Event Timeline

sdesmalen created this revision.Mar 14 2019, 4:21 AM

Herald added a subscriber: javed.absar. · View Herald TranscriptMar 14 2019, 4:21 AM

sdesmalen added a parent revision: D59259: [AArch64] Use faddp to implement fadd reductions..Mar 14 2019, 4:21 AM

See also https://bugs.llvm.org/show_bug.cgi?id=36734 and https://reviews.llvm.org/D45336. According to @aemerson this is intended behavior.

Imho we should definitely make this change, and consider also making it for undef -- it seems like a bad idea to special-case it, because it may negatively affect optimization.

If there are concerns about backwards compatibility for experimental intrinsics, we should rename and autoupgrade them. (Upgrade fadd to fadd.acc, either with original argument or 0.0 argument, depending on whether it is fast).

@nikic thanks for pointing me to that discussion! I clearly misread the LangRef for this change :)
I agree it makes more sense to change these intrinsics and to make its accumulator argument always relevant (regardless of what flags are set) and AutoUpgrading older IR. Given that I'm currently working on this, I'd be happy to move this forward with patches and a proposal/discussion on the mailing list to change the experimental reduction intrinsics. @aemerson you expressed an intention to work on it later this year, do you have any objection to me moving forward with this now?

In D59356#1429581, @sdesmalen wrote:

@nikic thanks for pointing me to that discussion! I clearly misread the LangRef for this change :)
I agree it makes more sense to change these intrinsics and to make its accumulator argument always relevant (regardless of what flags are set) and AutoUpgrading older IR. Given that I'm currently working on this, I'd be happy to move this forward with patches and a proposal/discussion on the mailing list to change the experimental reduction intrinsics. @aemerson you expressed an intention to work on it later this year, do you have any objection to me moving forward with this now?

If you're going down the auto-upgrade route then I suggest proposing that we promote these from experimental to first class intrinsics. That way you can auto-upgrade form one intrinsic to another without any risk of breaking older code (i.e. you can't just start using an accumulator arg that before could be unused and therefore undef).

In D59356#1429865, @aemerson wrote:

In D59356#1429581, @sdesmalen wrote:

@nikic thanks for pointing me to that discussion! I clearly misread the LangRef for this change :)
I agree it makes more sense to change these intrinsics and to make its accumulator argument always relevant (regardless of what flags are set) and AutoUpgrading older IR. Given that I'm currently working on this, I'd be happy to move this forward with patches and a proposal/discussion on the mailing list to change the experimental reduction intrinsics. @aemerson you expressed an intention to work on it later this year, do you have any objection to me moving forward with this now?

If you're going down the auto-upgrade route then I suggest proposing that we promote these from experimental to first class intrinsics. That way you can auto-upgrade form one intrinsic to another without any risk of breaking older code (i.e. you can't just start using an accumulator arg that before could be unused and therefore undef).

I'm not sure I agree - we shouldn't drop the experimental state unless we're certain the intrinsic is not going to need to be further tweaked in the future. I still think not including an accumulator argument at all would be for the best.

In D59356#1430032, @RKSimon wrote:

In D59356#1429865, @aemerson wrote:

In D59356#1429581, @sdesmalen wrote:

@nikic thanks for pointing me to that discussion! I clearly misread the LangRef for this change :)
I agree it makes more sense to change these intrinsics and to make its accumulator argument always relevant (regardless of what flags are set) and AutoUpgrading older IR. Given that I'm currently working on this, I'd be happy to move this forward with patches and a proposal/discussion on the mailing list to change the experimental reduction intrinsics. @aemerson you expressed an intention to work on it later this year, do you have any objection to me moving forward with this now?

If you're going down the auto-upgrade route then I suggest proposing that we promote these from experimental to first class intrinsics. That way you can auto-upgrade form one intrinsic to another without any risk of breaking older code (i.e. you can't just start using an accumulator arg that before could be unused and therefore undef).

I'm not sure I agree - we shouldn't drop the experimental state unless we're certain the intrinsic is not going to need to be further tweaked in the future. I still think not including an accumulator argument at all would be for the best.

There needs to be some way to determine when it's safe to upgrade the IR, either through a name change or some other mechanism. The old implementations which pass undef/whatever as an accumulator can't start being compiled to code making use of it.

If we don't include an accumulator, then we have to split the intrinsics into strictly ordered and non-strictly ordered. But that's what the fast math flag is for on the call, so the flag becomes useless.

This one can probably be abandoned in favor of D60261?

Superseded by D60261.

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

SelectionDAGBuilder.cpp

12 lines

test/

CodeGen/

AArch64/

vecreduce-fadd.ll

27 lines

Diff 190599

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,620 Lines • ▼ Show 20 Lines	void SelectionDAGBuilder::visitVectorReduce(const CallInst &I,
EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());		EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());
SDValue Res;		SDValue Res;
FastMathFlags FMF;		FastMathFlags FMF;
if (isa<FPMathOperator>(I))		if (isa<FPMathOperator>(I))
FMF = I.getFastMathFlags();		FMF = I.getFastMathFlags();

switch (Intrinsic) {		switch (Intrinsic) {
case Intrinsic::experimental_vector_reduce_fadd:		case Intrinsic::experimental_vector_reduce_fadd:
if (FMF.isFast())		if (FMF.isFast()) {
Res = DAG.getNode(ISD::VECREDUCE_FADD, dl, VT, Op2);		Res = DAG.getNode(ISD::VECREDUCE_FADD, dl, VT, Op2);
else		if (!Op1.isUndef())
		Res = DAG.getNode(ISD::FADD, dl, VT, Op1, Res);
		} else
Res = DAG.getNode(ISD::VECREDUCE_STRICT_FADD, dl, VT, Op1, Op2);		Res = DAG.getNode(ISD::VECREDUCE_STRICT_FADD, dl, VT, Op1, Op2);
break;		break;
case Intrinsic::experimental_vector_reduce_fmul:		case Intrinsic::experimental_vector_reduce_fmul:
if (FMF.isFast())		if (FMF.isFast()) {
Res = DAG.getNode(ISD::VECREDUCE_FMUL, dl, VT, Op2);		Res = DAG.getNode(ISD::VECREDUCE_FMUL, dl, VT, Op2);
else		if (!Op1.isUndef())
		Res = DAG.getNode(ISD::FMUL, dl, VT, Op1, Res);
		} else
Res = DAG.getNode(ISD::VECREDUCE_STRICT_FMUL, dl, VT, Op1, Op2);		Res = DAG.getNode(ISD::VECREDUCE_STRICT_FMUL, dl, VT, Op1, Op2);
break;		break;
case Intrinsic::experimental_vector_reduce_add:		case Intrinsic::experimental_vector_reduce_add:
Res = DAG.getNode(ISD::VECREDUCE_ADD, dl, VT, Op1);		Res = DAG.getNode(ISD::VECREDUCE_ADD, dl, VT, Op1);
break;		break;
case Intrinsic::experimental_vector_reduce_mul:		case Intrinsic::experimental_vector_reduce_mul:
Res = DAG.getNode(ISD::VECREDUCE_MUL, dl, VT, Op1);		Res = DAG.getNode(ISD::VECREDUCE_MUL, dl, VT, Op1);
break;		break;
▲ Show 20 Lines • Show All 2,060 Lines • Show Last 20 Lines

test/CodeGen/AArch64/vecreduce-fadd.ll

	Show First 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: add_2D:			; CHECK-LABEL: add_2D:
	; CHECK: fadd v0.2d, v0.2d, v1.2d			; CHECK: fadd v0.2d, v0.2d, v1.2d
	; CHECK-NEXT: faddp d0, v0.2d			; CHECK-NEXT: faddp d0, v0.2d
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%r = call fast double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double undef, <4 x double> %bin.rdx)			%r = call fast double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double undef, <4 x double> %bin.rdx)
	ret double %r			ret double %r
	}			}

				define half @add_H_init42(<8 x half> %bin.rdx) {
				; CHECK-LABEL: add_H_init42:
				; CHECK: faddp h0, v0.2h
				; CHECK: fadd h0
				; CHECK-NEXT: ret
				%r = call fast half @llvm.experimental.vector.reduce.fadd.f16.v8f16(half 42.0, <8 x half> %bin.rdx)
				ret half %r
				}

				define float @add_S_init42(<4 x float> %bin.rdx) {
				; CHECK-LABEL: add_S_init42:
				; CHECK: faddp s0, v0.2s
				; CHECK: fadd s0
				; CHECK-NEXT: ret
				%r = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float 42.0, <4 x float> %bin.rdx)
				ret float %r
				}

				define double @add_D_init42(<2 x double> %bin.rdx) {
				; CHECK-LABEL: add_D_init42:
				; CHECK: faddp d0, v0.2d
				; CHECK: fadd d0
				; CHECK-NEXT: ret
				%r = call fast double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double 42.0, <2 x double> %bin.rdx)
				ret double %r
				}

	; Function Attrs: nounwind readnone			; Function Attrs: nounwind readnone
	declare half @llvm.experimental.vector.reduce.fadd.f16.v4f16(half, <4 x half>)			declare half @llvm.experimental.vector.reduce.fadd.f16.v4f16(half, <4 x half>)
	declare half @llvm.experimental.vector.reduce.fadd.f16.v8f16(half, <8 x half>)			declare half @llvm.experimental.vector.reduce.fadd.f16.v8f16(half, <8 x half>)
	declare half @llvm.experimental.vector.reduce.fadd.f16.v16f16(half, <16 x half>)			declare half @llvm.experimental.vector.reduce.fadd.f16.v16f16(half, <16 x half>)
	declare float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float, <2 x float>)			declare float @llvm.experimental.vector.reduce.fadd.f32.v2f32(float, <2 x float>)
	declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float, <4 x float>)			declare float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float, <4 x float>)
	declare float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float, <8 x float>)			declare float @llvm.experimental.vector.reduce.fadd.f32.v8f32(float, <8 x float>)
	declare double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double, <2 x double>)			declare double @llvm.experimental.vector.reduce.fadd.f64.v2f64(double, <2 x double>)
	declare double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double, <4 x double>)			declare double @llvm.experimental.vector.reduce.fadd.f64.v4f64(double, <4 x double>)