This is an archive of the discontinued LLVM Phabricator instance.

transform fadd chains to increase parallelism
ClosedPublic

Authored by spatel on Apr 23 2015, 1:50 PM.

Download Raw Diff

Details

Reviewers

qcolombet
andreadb
hfinkel

Commits

rG2fbc4e5c4942: transform fadd chains to increase parallelism
rL236031: transform fadd chains to increase parallelism

Summary

This is a compromise: with this simple patch, we should always handle a chain of exactly 3 operations optimally, but we're not generating the optimal balanced binary tree for a longer sequence.

In general, this transform will reduce the dependency chain for a sequence of instructions using N operands from a worst case N-1 dependent operations to N/2 dependent operations. The optimal balanced binary tree would reduce the chain to log2(N).

As I see it, the trade-off for not dealing with longer sequences is: (1) we have less complexity in the compiler, (2) we avoid unknown compile-time blowup calculating a balanced tree, and (3) we don't need to worry about the increased register pressure required to parallelize longer sequences. It also seems unlikely that we would ever encounter really long strings of dependent ops like that in the wild, but I'm not sure how to verify that speculation. FWIW, I see no perf difference for test-suite running on btver2 (x86-64) with -ffast-math and this patch.

If this patch looks ok, then I can extend it to cover other associative operations such as fmul, fmax, fmin, integer add, integer mul.

This is a partial fix for:
https://llvm.org/bugs/show_bug.cgi?id=17305

and if extended:
https://llvm.org/bugs/show_bug.cgi?id=21768
https://llvm.org/bugs/show_bug.cgi?id=23116

The issue also came up in:
http://reviews.llvm.org/D8941

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 24326.Apr 23 2015, 1:50 PM

spatel retitled this revision from to transform fadd chains to increase parallelism.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: hfinkel, qcolombet, andreadb.

spatel added a subscriber: Unknown Object (MLST).

qcolombet added inline comments.Apr 28 2015, 11:15 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7662 ↗	(On Diff #24326)	I would prefer the comment to match the actual code, i.e., invert the order of the operand: (fadd (fadd (fadd z, w), y), x) -> (fadd (fadd z, w), (fadd x, y)) You could even use named operands like this: (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) -> (fadd N00: (fadd z, w), (fadd N1: x, M01: y))
7666 ↗	(On Diff #24326)	You can move this assignment into the next if.
test/CodeGen/X86/fp-fast.ll
124 ↗	(On Diff #24326)	Can’t you be more specific on the input registers? With a pattern like this, I believe even the old inefficient sequence would match, wouldn’t it?

spatel added inline comments.Apr 28 2015, 11:27 AM

test/CodeGen/X86/fp-fast.ll
124 ↗	(On Diff #24326)	Hi Quentin, Thanks for reviewing this patch. I don't think we can be more specific on the inputs: we know that xmm0 - xmm3 are the input registers, but the order of the operands as well as the order of the first two adds may be commuted (not by this patch, but some future patch)? I made sure that the last check will not work without this patch. It requires that the outputs of the first two adds are inputs to the third add. This final add check is actually too specific because it fixes the order of the operands. I tried every regex combo that I could think of to make that more flexible, but couldn't get anything to work with FileCheck.

qcolombet added inline comments.Apr 28 2015, 11:39 AM

test/CodeGen/X86/fp-fast.ll
124 ↗	(On Diff #24326)	Let stick to the current order of the operands. If they change we can fix them. Anyhow, this is not a bit deal, I was just not sure, it wouldn’t match the old sequence :).

spatel added inline comments.Apr 28 2015, 1:29 PM

test/CodeGen/X86/fp-fast.ll
124 ↗	(On Diff #24326)	Ok - if we're not too concerned with flexibility of the match, it becomes easy. :) I can just use update_llc_test_checks.py for the exact match.

Patch updated:

Fixed fold comment to match code
Moved variable declaration closer to use
Made test CHECK lines match expected output exactly

Thanks Sanjay!

LGTM.

This revision is now accepted and ready to land.Apr 28 2015, 1:49 PM

Closed by commit rL236031: transform fadd chains to increase parallelism (authored by spatel). · Explain WhyApr 28 2015, 2:06 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D9780: expose ILP for associative operations in the DAG.May 14 2015, 12:18 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

18 lines

test/

CodeGen/

X86/

fp-fast.ll

43 lines

Diff 24577

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,795 Lines • ▼ Show 20 Lines	if (TLI.isOperationLegalOrCustom(ISD::FMUL, VT) && !N0CFP && !N1CFP) {
N0.getOperand(0) == N0.getOperand(1) &&		N0.getOperand(0) == N0.getOperand(1) &&
N1.getOperand(0) == N1.getOperand(1) &&		N1.getOperand(0) == N1.getOperand(1) &&
N0.getOperand(0) == N1.getOperand(0)) {		N0.getOperand(0) == N1.getOperand(0)) {
SDLoc DL(N);		SDLoc DL(N);
return DAG.getNode(ISD::FMUL, DL, VT,		return DAG.getNode(ISD::FMUL, DL, VT,
N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));		N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));
}		}
}		}

		// Canonicalize chains of adds to LHS to simplify the following transform.
		if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
		return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);

		// Convert a chain of 3 dependent operations into 2 independent operations
		// and 1 dependent operation:
		// (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) ->
		// (fadd N00: (fadd z, w), (fadd N1: x, N01: y))
		if (N0.getOpcode() == ISD::FADD && N0.hasOneUse() &&
		N1.getOpcode() != ISD::FADD) {
		SDValue N00 = N0.getOperand(0);
		if (N00.getOpcode() == ISD::FADD) {
		SDValue N01 = N0.getOperand(1);
		SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N01);
		return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
		}
		}
} // enable-unsafe-fp-math		} // enable-unsafe-fp-math

// FADD -> FMA combines:		// FADD -> FMA combines:
SDValue Fused = visitFADDForFMACombine(N);		SDValue Fused = visitFADDForFMACombine(N);
if (Fused) {		if (Fused) {
AddToWorklist(Fused.getNode());		AddToWorklist(Fused.getNode());
return Fused;		return Fused;
}		}
▲ Show 20 Lines • Show All 6,056 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fp-fast.ll

	Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: test11:			; CHECK-LABEL: test11:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0			; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%t1 = fsub float -0.0, %a			%t1 = fsub float -0.0, %a
	%t2 = fadd float %a, %t1			%t2 = fadd float %a, %t1
	ret float %t2			ret float %t2
	}			}

				; Verify that the first two adds are independent; the destination registers
				; are used as source registers for the third add.

				define float @reassociate_adds1(float %a, float %b, float %c, float %d) {
				; CHECK-LABEL: reassociate_adds1:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%add0 = fadd float %a, %b
				%add1 = fadd float %add0, %c
				%add2 = fadd float %add1, %d
				ret float %add2
				}

				define float @reassociate_adds2(float %a, float %b, float %c, float %d) {
				; CHECK-LABEL: reassociate_adds2:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%add0 = fadd float %a, %b
				%add1 = fadd float %c, %add0
				%add2 = fadd float %add1, %d
				ret float %add2
				}

				define float @reassociate_adds3(float %a, float %b, float %c, float %d) {
				; CHECK-LABEL: reassociate_adds3:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%add0 = fadd float %a, %b
				%add1 = fadd float %add0, %c
				%add2 = fadd float %d, %add1
				ret float %add2
				}