This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Do several rounds of combine for addcarry nodes.
Needs ReviewPublic

Authored by deadalnix on Jul 3 2019, 4:57 PM.

Download Raw Diff

Details

Reviewers

hfinkel
baldrick
efriedma
RKSimon
craig.topper
niravd
davezarzycki
arsenm

Summary

addcarry can explore severallevels of the DAG docompute its result. This means that a transformation in the DAG that isn't immediately related to a node might affect it.

We therefore add these nodes to the DeepPatternNodes set so that they can be processed again in case the DAG is modfied.

In the future, other node may be added to the set as need arises.

This is a variation on D57367 that specialize in the specific case of large intereger optimization.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 35434
Build 35433: arc lint + arc unit

Event Timeline

deadalnix created this revision.Jul 3 2019, 4:57 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2019, 4:57 PM

Harbormaster completed remote builds in B34336: Diff 207927.Jul 3 2019, 4:58 PM

deadalnix mentioned this in D57317: [DAGCombine] Deduplicate addcarry node using commutativity..Jul 3 2019, 5:00 PM

deadalnix added a child revision: D57317: [DAGCombine] Deduplicate addcarry node using commutativity..Jul 3 2019, 5:03 PM

craig.topper added inline comments.Jul 8 2019, 9:43 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
155	tomatch-> to match

craig.topper added inline comments.Jul 8 2019, 9:55 PM

test/CodeGen/X86/addcarry.ll
326	Doesn't this add and adc compute the same result as line 321 and 325?

deadalnix marked an inline comment as done.Jul 11 2019, 4:35 AM

deadalnix added inline comments.

test/CodeGen/X86/addcarry.ll
326	There is a lot of duplication that is generated by this. But once the carry propagation is "linearized" because you removed all the diamonds, then a simple set of optimization can get rid of it all. See D57317 for that specific case.

tomatch => to match

Harbormaster completed remote builds in B34758: Diff 209187.Jul 11 2019, 5:47 AM

Rebase now that D59208 landed on master.

subcarry related test cases now get much better codegen.

Harbormaster completed remote builds in B35095: Diff 210109.Jul 16 2019, 8:38 AM

RKSimon added inline comments.Jul 16 2019, 9:51 AM

test/CodeGen/X86/addcarry.ll
326	Would adding X86ISD::ADD to X86TargetLowering::isCommutativeBinOp help with this?

craig.topper added inline comments.Jul 16 2019, 9:55 AM

test/CodeGen/X86/addcarry.ll
326	If I remember right from what I saw in the DAG. We need to CSE ISD::ADDCARRY with commuted operands in DAG combine. Not sure about the X86ISD node.

RKSimon added inline comments.Jul 16 2019, 10:00 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
1217	superfluous change - revert ?

RKSimon added inline comments.Jul 16 2019, 10:02 AM

test/CodeGen/X86/addcarry.ll
326	By the looks of this, DAGCombiner::combine should handle it - but is disabled for commutative binop nodes with more than 1 output value.....

deadalnix marked an inline comment as done.Jul 20 2019, 6:36 PM

deadalnix added inline comments.

test/CodeGen/X86/addcarry.ll
326	@RKSimon It doesn't help, because ADDCARRY is not a binary op. It has 3 inputs, 2 of which are commutative. However, D57317 is ready to go and address that exact problem. I did not land D57317 because as far as I know, this doesn't happen in the wild without this patch.

Revert the addition of an unecessary empty line.

Harbormaster completed remote builds in B35434: Diff 210981.Jul 20 2019, 7:12 PM

Rebase and ping

Harbormaster completed remote builds in B36581: Diff 214577.Aug 11 2019, 11:07 PM

IMHO this is very unusual and papers over some other problem.

In D64174#1624738, @lebedev.ri wrote:

IMHO this is very unusual and papers over some other problem.

The problem it paper over is that the pattern we are interested in are deep. The same technique benefits any deep pattern, such as anything using simplifydemandedbits, but there was concern that doing it in all cases would hurt performance. The reason you end up with deep patterns in that case is because you find yourself facing things such as:

(addcarry (uaddo a, b), 0, c) and now you have two carries that propagate and the rest of the DAG is a total mess. This benefits cryptographic computation - and I'd expect many large integer computation in general - substantially.

The approach isn't that unusual as it is similar to what InstCombine does, only more focused on specific nodes that especially benefit to ensure performances stay good in case this isn't required.

While InstCombine does iterate the entire graph multiple times, it rarely does it more than twice. And the second time is just to detect there no changes. InstCombine largely avoids issues like this because it visits the instructions mostly from the beginning of the basic block down. This means as new instructions are formed their users likely haven’t even been visited yet.

DAGCombine on the other hand visits nodes from the end of the basic block up for the first round. So nodes early in the block haven’t been simplified yet. This results in needing to match long patterns or multiple variations since we can’t rely on canonicalization other than what was done in IR. After the first round the order is basically to visit any new nodes created by the previous legalize. Because they are end of the allnodes list which isn’t re-sorted during legalize or before DAGCombine Not sure about the order here. Then the nodes that did not get modified are visited. Not sure about the order, but I think it’s from the end of the basic block again.

So the addcarry nodes later in the DAG get formed first and need to be revisited when earlier ones are created to get additional optimizations. If the earlier ones were formed first by going from beginning to end we could probably get the optimizations in one pass. That would require changing how DAG combine builds it worklist and doing a topological sort before each DAGCombine run. This will likely require adding new pattern matches and maybe modifying existing ones. Not sure how much this will be.

I also wonder if there’s some better IR representation we should be using. Should we have an addcarry intrinsic that InstCombine could form? Is global isel better designed to handle this then DAGCombine? Would having a better IR representation help both?

In D64174#1634056, @craig.topper wrote:

I also wonder if there’s some better IR representation we should be using. Should we have an addcarry intrinsic that InstCombine could form? Is global isel better designed to handle this then DAGCombine? Would having a better IR representation help both?

This doesn't really work here. These nodes are often created during to legalization, which happens at the Selection DAG level, not at the IR level. I toyed with the idea of introducing an early legalize pass at the IR level, but it did not seem to get traction.

It is indeed true that InstCombine tends to do better due to node being processed in order. DAGCombiner has several problems here, the main one being that the order in which nodes are traversed is not very predictable, as it depends on the order in which the node were created and added to the DAG. After legalization, you typically end up visiting them in an order that is 100% implementation defined and isn't strictly top/down or bottom up. Changing this lead very quickly to a world of pain, because there are a ton of cases where DAGCombiner transforms A into B if it sees A first, and transform B into A if it sees B first. An example of this is (fadd NaN, undef) and (fadd undef, NaN) which transform into each others.

Topological sorting tends to be an expensive operation, it can easily go into n^2 territory, so I'm not convinced this is a win because any insertion into the DAG would break it as there is no nothing of insertion at a given position. Do you see a way to create/maintain a worklist that is ordered as expected?

Changing this lead very quickly to a world of pain

Probably, yes. :)

Topological sorting tends to be an expensive operation, it can easily go into n^2 territory

Sorting once should be linear in the number of edges, and SelectionDAG DAGs generally have few edges. And it's not that expensive in practice; we do it twice already as part of normal compilation. (See AssignTopologicalOrder.)

Do you see a way to create/maintain a worklist that is ordered as expected?

isel itself is an algorithm which incrementally transforms a SelectionDAG DAG and maintains topological ordering as it works. It visits the wrong direction for what we want, but it should be possible to reverse. Granted, this is maybe not the best approach; most DAGCombines take a node and produce one or more new nodes, just like isel, but I'm not sure how you'd handle the ones that don't.

Or you could probably reproduce something similar without actually maintaining the full topological sort on the DAG itself. Topological sort at the beginning, to build an initial worklist. Then, whenever new nodes are created, keep a separate list of those nodes, and visit them in order of creation before continuing with the regular worklist. For most combines, which visit a node, and replace it with one or more new nodes, all the new nodes should have operands which are either new nodes or nodes that have already been visited, and the results should only be used by new nodes and nodes which have not yet been visited. So you maintain sorted visitation order for the existing nodes. And creation order should be naturally sorted for the new nodes.

SO I came up with a plan to get things processed in topological order. However, as mentioned previously, this is indeed leading into a world of pain. I have a prototype, but it causes numerous regression that I want to investigate.

The general outline is as follow:
1/ Remove as much as possible of the manual management of the worklist. It is not realistic to expect every combine for every opcode + target specific combines to be able to operate the worklist consistently, so utility method to help them do so has to be added. Fortunately, often, this management is not doing anything useful at all and can simply be removed.
2/ Modify the utility methods that are expected to actually manipulate the worklist to mostly preserve topological order. It turn out it is not too difficult to have them do a good enough job.
3/ Do a topological sort at the beginning of the process, before adding all the nodes to the worklist.
4/ Modify the logic that adds the argument of a node to the worklist in case they where not present already to do lazy fixup when required to keep the processing topological. Because of 2 and 3, this ends up to be a lightweight process.

As it turns out, many transform done by DAGCombiner are dependent on the processing order. Because this order is not specifically defined, this means such transform are unreliable in practice, and indeed many do break when the processing order is modified. This is another reason why 2 and 3 are useful, as they allow to move gradually instead of producing one giant diff of death. The opposite also happens, where pattern that were not picked upon before now are being exploited.

I already started to execute step 1 of this plan. If people are happy with taking that path, I'd be happy to push that effort forward. I could do with some help to investigate the regression, as I'm not an expert in every single backends.

That sounds good to me personally.
I found it surprizing that DAGCombine processes things in different order as compared to InstCombine.
Thank you for looking into it.

Adding @davezarzycki who was working on D70079 recently

Please rebase too

arsenm resigned from this revision.Feb 13 2020, 2:53 PM

greened mentioned this in D116832: [UpdateLLCTestChecks] Allow replacing register names with variables.Jan 12 2022, 10:48 AM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

131 lines

test/

CodeGen/

X86/

addcarry.ll

41 lines

subcarry.ll

34 lines

Diff 210981

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	class DAGCombiner {
SmallSetVector<SDNode *, 32> PruningList;		SmallSetVector<SDNode *, 32> PruningList;

/// Set of nodes which have been combined (at least once).		/// Set of nodes which have been combined (at least once).
///		///
/// This is used to allow us to reliably add any operands of a DAG node		/// This is used to allow us to reliably add any operands of a DAG node
/// which have not yet been combined to the worklist.		/// which have not yet been combined to the worklist.
SmallPtrSet<SDNode *, 32> CombinedNodes;		SmallPtrSet<SDNode *, 32> CombinedNodes;

		/// Set of nodes which have tried to match a deep pattern.
		craig.topperUnsubmitted Not Done Reply Inline Actions tomatch-> to match craig.topper: tomatch-> to match
		///
		/// This is used to ensure the nodes gets processed again in case the
		/// DAG is modified.
		SmallPtrSet<SDNode *, 16> DeepPatternNodes;

// AA - Used for DAG load/store alias analysis.		// AA - Used for DAG load/store alias analysis.
AliasAnalysis *AA;		AliasAnalysis *AA;

/// When an instruction is simplified, add all users of the instruction to		/// When an instruction is simplified, add all users of the instruction to
/// the work lists because they might get more simplified now.		/// the work lists because they might get more simplified now.
void AddUsersToWorklist(SDNode *N) {		void AddUsersToWorklist(SDNode *N) {
for (SDNode *Node : N->uses())		for (SDNode *Node : N->uses())
AddToWorklist(Node);		AddToWorklist(Node);
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	void AddToWorklist(SDNode *N) {
if (WorklistMap.insert(std::make_pair(N, Worklist.size())).second)		if (WorklistMap.insert(std::make_pair(N, Worklist.size())).second)
Worklist.push_back(N);		Worklist.push_back(N);
}		}

/// Remove all instances of N from the worklist.		/// Remove all instances of N from the worklist.
void removeFromWorklist(SDNode *N) {		void removeFromWorklist(SDNode *N) {
CombinedNodes.erase(N);		CombinedNodes.erase(N);
PruningList.remove(N);		PruningList.remove(N);
		DeepPatternNodes.erase(N);

auto It = WorklistMap.find(N);		auto It = WorklistMap.find(N);
if (It == WorklistMap.end())		if (It == WorklistMap.end())
return; // Not in the worklist.		return; // Not in the worklist.

// Null out the entry rather than erasing it to avoid a linear operation.		// Null out the entry rather than erasing it to avoid a linear operation.
Worklist[It->second] = nullptr;		Worklist[It->second] = nullptr;
WorklistMap.erase(It);		WorklistMap.erase(It);
▲ Show 20 Lines • Show All 955 Lines • ▼ Show 20 Lines	if (TLO.Old.getNode()->use_empty())
deleteAndRecombine(TLO.Old.getNode());		deleteAndRecombine(TLO.Old.getNode());
}		}

/// Check the specified integer node value to see if it can be simplified or if		/// Check the specified integer node value to see if it can be simplified or if
/// things it uses can be simplified by bit propagation. If so, return true.		/// things it uses can be simplified by bit propagation. If so, return true.
bool DAGCombiner::SimplifyDemandedBits(SDValue Op, const APInt &DemandedBits,		bool DAGCombiner::SimplifyDemandedBits(SDValue Op, const APInt &DemandedBits,
const APInt &DemandedElts) {		const APInt &DemandedElts) {
TargetLowering::TargetLoweringOpt TLO(DAG, LegalTypes, LegalOperations);		TargetLowering::TargetLoweringOpt TLO(DAG, LegalTypes, LegalOperations);
KnownBits Known;		KnownBits Known;
		RKSimonUnsubmitted Not Done Reply Inline Actions superfluous change - revert ? RKSimon: superfluous change - revert ?
if (!TLI.SimplifyDemandedBits(Op, DemandedBits, DemandedElts, Known, TLO))		if (!TLI.SimplifyDemandedBits(Op, DemandedBits, DemandedElts, Known, TLO))
return false;		return false;

// Revisit the node.		// Revisit the node.
AddToWorklist(Op.getNode());		AddToWorklist(Op.getNode());

// Replace the old value with the new one.		// Replace the old value with the new one.
++NodesCombined;		++NodesCombined;
▲ Show 20 Lines • Show All 350 Lines • ▼ Show 20 Lines	void DAGCombiner::Run(CombineLevel AtLevel) {
for (SDNode &Node : DAG.allnodes())		for (SDNode &Node : DAG.allnodes())
AddToWorklist(&Node);		AddToWorklist(&Node);

// Create a dummy node (which is not added to allnodes), that adds a reference		// Create a dummy node (which is not added to allnodes), that adds a reference
// to the root node, preventing it from being deleted, and tracking any		// to the root node, preventing it from being deleted, and tracking any
// changes of the root.		// changes of the root.
HandleSDNode Dummy(DAG.getRoot());		HandleSDNode Dummy(DAG.getRoot());

		for (unsigned Iteration = 0; Iteration < 3; Iteration++) {
		bool Changed = false;

// While we have a valid worklist entry node, try to combine it.		// While we have a valid worklist entry node, try to combine it.
while (SDNode *N = getNextWorklistEntry()) {		while (SDNode *N = getNextWorklistEntry()) {
// If N has no uses, it is dead. Make sure to revisit all N's operands once		// If N has no uses, it is dead. Make sure to revisit all N's operands once
// N is deleted from the DAG, since they too may now be dead or may have a		// N is deleted from the DAG, since they too may now be dead or may have a
// reduced number of uses, allowing other xforms.		// reduced number of uses, allowing other xforms.
if (recursivelyDeleteUnusedNodes(N))		if (recursivelyDeleteUnusedNodes(N))
continue;		continue;

WorklistRemover DeadNodes(*this);		WorklistRemover DeadNodes(*this);

// If this combine is running after legalizing the DAG, re-legalize any		// If this combine is running after legalizing the DAG, re-legalize any
// nodes pulled off the worklist.		// nodes pulled off the worklist.
if (Level == AfterLegalizeDAG) {		if (Level == AfterLegalizeDAG) {
SmallSetVector<SDNode *, 16> UpdatedNodes;		SmallSetVector<SDNode *, 16> UpdatedNodes;
bool NIsValid = DAG.LegalizeOp(N, UpdatedNodes);		bool NIsValid = DAG.LegalizeOp(N, UpdatedNodes);

for (SDNode *LN : UpdatedNodes) {		for (SDNode *LN : UpdatedNodes) {
AddToWorklist(LN);		AddToWorklist(LN);
AddUsersToWorklist(LN);		AddUsersToWorklist(LN);
}		}
if (!NIsValid)		if (!NIsValid)
continue;		continue;
}		}

LLVM_DEBUG(dbgs() << "\nCombining: "; N->dump(&DAG));		LLVM_DEBUG(dbgs() << "\nCombining: "; N->dump(&DAG));

// Add any operands of the new node which have not yet been combined to the		// Add any operands of the new node which have not yet been combined to
// worklist as well. Because the worklist uniques things already, this		// the worklist as well. Because the worklist uniques things already,
// won't repeatedly process the same operand.		// this won't repeatedly process the same operand.
CombinedNodes.insert(N);		CombinedNodes.insert(N);
for (const SDValue &ChildN : N->op_values())		for (const SDValue &ChildN : N->op_values())
if (!CombinedNodes.count(ChildN.getNode()))		if (!CombinedNodes.count(ChildN.getNode()))
AddToWorklist(ChildN.getNode());		AddToWorklist(ChildN.getNode());

SDValue RV = combine(N);		SDValue RV = combine(N);

if (!RV.getNode())		if (!RV.getNode())
continue;		continue;

++NodesCombined;		++NodesCombined;
		Changed = true;

// If we get back the same node we passed in, rather than a new node or		// If we get back the same node we passed in, rather than a new node or
// zero, we know that the node must have defined multiple values and		// zero, we know that the node must have defined multiple values and
// CombineTo was used. Since CombineTo takes care of the worklist		// CombineTo was used. Since CombineTo takes care of the worklist
// mechanics for us, we have no work to do in this case.		// mechanics for us, we have no work to do in this case.
if (RV.getNode() == N)		if (RV.getNode() == N)
continue;		continue;

assert(N->getOpcode() != ISD::DELETED_NODE &&		assert(N->getOpcode() != ISD::DELETED_NODE &&
RV.getOpcode() != ISD::DELETED_NODE &&		RV.getOpcode() != ISD::DELETED_NODE &&
"Node was deleted but visit returned new node!");		"Node was deleted but visit returned new node!");

LLVM_DEBUG(dbgs() << " ... into: "; RV.getNode()->dump(&DAG));		LLVM_DEBUG(dbgs() << " ... into: "; RV.getNode()->dump(&DAG));

if (N->getNumValues() == RV.getNode()->getNumValues())		if (N->getNumValues() == RV.getNode()->getNumValues())
DAG.ReplaceAllUsesWith(N, RV.getNode());		DAG.ReplaceAllUsesWith(N, RV.getNode());
else {		else {
assert(N->getValueType(0) == RV.getValueType() &&		assert(N->getValueType(0) == RV.getValueType() &&
N->getNumValues() == 1 && "Type mismatch");		N->getNumValues() == 1 && "Type mismatch");
DAG.ReplaceAllUsesWith(N, &RV);		DAG.ReplaceAllUsesWith(N, &RV);
}		}

// Push the new node and any users onto the worklist		// Push the new node and any users onto the worklist
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
AddUsersToWorklist(RV.getNode());		AddUsersToWorklist(RV.getNode());

// Finally, if the node is now dead, remove it from the graph. The node		// Finally, if the node is now dead, remove it from the graph. The node
// may not be dead if the replacement process recursively simplified to		// may not be dead if the replacement process recursively simplified to
// something else needing this node. This will also take care of adding any		// something else needing this node. This will also take care of adding any
// operands which have lost a user to the worklist.		// operands which have lost a user to the worklist.
recursivelyDeleteUnusedNodes(N);		recursivelyDeleteUnusedNodes(N);
}		}

		if (!Changed)
		break;

		// Make sure we process the nodes for which a new combine may exist.
		for (SDNode *N : DeepPatternNodes)
		AddToWorklist(N);
		}

// If the root changed (e.g. it was a dead load, update the root).		// If the root changed (e.g. it was a dead load, update the root).
DAG.setRoot(Dummy.getValue());		DAG.setRoot(Dummy.getValue());
DAG.RemoveDeadNodes();		DAG.RemoveDeadNodes();
}		}

SDValue DAGCombiner::visit(SDNode *N) {		SDValue DAGCombiner::visit(SDNode *N) {
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default: break;		default: break;
▲ Show 20 Lines • Show All 1,324 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitADDCARRYLike(SDValue N0, SDValue N1, SDValue CarryIn,
// or the dependency between the instructions.		// or the dependency between the instructions.
if ((N0.getOpcode() == ISD::ADD \|\|		if ((N0.getOpcode() == ISD::ADD \|\|
(N0.getOpcode() == ISD::UADDO && N0.getResNo() == 0 &&		(N0.getOpcode() == ISD::UADDO && N0.getResNo() == 0 &&
N0.getValue(1) != CarryIn)) &&		N0.getValue(1) != CarryIn)) &&
isNullConstant(N1) && !N->hasAnyUseOfValue(1))		isNullConstant(N1) && !N->hasAnyUseOfValue(1))
return DAG.getNode(ISD::ADDCARRY, SDLoc(N), N->getVTList(),		return DAG.getNode(ISD::ADDCARRY, SDLoc(N), N->getVTList(),
N0.getOperand(0), N0.getOperand(1), CarryIn);		N0.getOperand(0), N0.getOperand(1), CarryIn);

		DeepPatternNodes.insert(N);

/**		/**
* When one of the addcarry argument is itself a carry, we may be facing		* When one of the addcarry argument is itself a carry, we may be facing
* a diamond carry propagation. In which case we try to transform the DAG		* a diamond carry propagation. In which case we try to transform the DAG
* to ensure linear carry propagation if that is possible.		* to ensure linear carry propagation if that is possible.
*/		*/
if (auto Y = getAsCarry(TLI, N1)) {		if (auto Y = getAsCarry(TLI, N1)) {
// Because both are carries, Y and Z can be swapped.		// Because both are carries, Y and Z can be swapped.
if (auto R = combineADDCARRYDiamond(*this, DAG, N0, Y, CarryIn, N))		if (auto R = combineADDCARRYDiamond(*this, DAG, N0, Y, CarryIn, N))
▲ Show 20 Lines • Show All 17,841 Lines • Show Last 20 Lines

test/CodeGen/X86/addcarry.ll

Show First 20 Lines • Show All 306 Lines • ▼ Show 20 Lines	entry:
ret i64 %6		ret i64 %6
}		}

%S = type { [4 x i64] }		%S = type { [4 x i64] }

define %S @readd(%S* nocapture readonly %this, %S %arg.b) {		define %S @readd(%S* nocapture readonly %this, %S %arg.b) {
; CHECK-LABEL: readd:		; CHECK-LABEL: readd:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
		; CHECK-NEXT: pushq %rbx
		; CHECK-NEXT: .cfi_def_cfa_offset 16
		; CHECK-NEXT: .cfi_offset %rbx, -16
; CHECK-NEXT: movq %rdi, %rax		; CHECK-NEXT: movq %rdi, %rax
; CHECK-NEXT: addq (%rsi), %rdx		; CHECK-NEXT: movq (%rsi), %r10
; CHECK-NEXT: movq 8(%rsi), %r11		; CHECK-NEXT: movq %rdx, %rdi
; CHECK-NEXT: adcq $0, %r11		; CHECK-NEXT: addq %r10, %rdi
; CHECK-NEXT: setb %r10b		; CHECK-NEXT: movq 8(%rsi), %rdi
; CHECK-NEXT: movzbl %r10b, %edi		; CHECK-NEXT: movq 16(%rsi), %r11
; CHECK-NEXT: addq %rcx, %r11		; CHECK-NEXT: movq %rcx, %rbx
; CHECK-NEXT: adcq 16(%rsi), %rdi		; CHECK-NEXT: adcq %rdi, %rbx
		; CHECK-NEXT: addq %r10, %rdx
		craig.topperUnsubmitted Not Done Reply Inline Actions Doesn't this add and adc compute the same result as line 321 and 325? craig.topper: Doesn't this add and adc compute the same result as line 321 and 325?
		deadalnixAuthorUnsubmitted Done Reply Inline Actions There is a lot of duplication that is generated by this. But once the carry propagation is "linearized" because you removed all the diamonds, then a simple set of optimization can get rid of it all. See D57317 for that specific case. deadalnix: There is a lot of duplication that is generated by this. But once the carry propagation is…
		RKSimonUnsubmitted Not Done Reply Inline Actions Would adding X86ISD::ADD to X86TargetLowering::isCommutativeBinOp help with this? RKSimon: Would adding X86ISD::ADD to X86TargetLowering::isCommutativeBinOp help with this?
		craig.topperUnsubmitted Not Done Reply Inline Actions If I remember right from what I saw in the DAG. We need to CSE ISD::ADDCARRY with commuted operands in DAG combine. Not sure about the X86ISD node. craig.topper: If I remember right from what I saw in the DAG. We need to CSE ISD::ADDCARRY with commuted…
		RKSimonUnsubmitted Not Done Reply Inline Actions By the looks of this, DAGCombiner::combine should handle it - but is disabled for commutative binop nodes with more than 1 output value..... RKSimon: By the looks of this, DAGCombiner::combine should handle it - but is disabled for commutative…
		deadalnixAuthorUnsubmitted Done Reply Inline Actions @RKSimon It doesn't help, because ADDCARRY is not a binary op. It has 3 inputs, 2 of which are commutative. However, D57317 is ready to go and address that exact problem. I did not land D57317 because as far as I know, this doesn't happen in the wild without this patch. deadalnix: @RKSimon It doesn't help, because ADDCARRY is not a binary op. It has 3 inputs, 2 of which are…
		; CHECK-NEXT: adcq %rdi, %rcx
; CHECK-NEXT: setb %cl		; CHECK-NEXT: setb %cl
; CHECK-NEXT: movzbl %cl, %ecx		; CHECK-NEXT: movq %r8, %rdi
; CHECK-NEXT: addq %r8, %rdi		; CHECK-NEXT: adcq %r11, %rdi
; CHECK-NEXT: adcq 24(%rsi), %rcx		; CHECK-NEXT: addb $255, %cl
; CHECK-NEXT: addq %r9, %rcx		; CHECK-NEXT: adcq %r11, %r8
		; CHECK-NEXT: adcq 24(%rsi), %r9
; CHECK-NEXT: movq %rdx, (%rax)		; CHECK-NEXT: movq %rdx, (%rax)
; CHECK-NEXT: movq %r11, 8(%rax)		; CHECK-NEXT: movq %rbx, 8(%rax)
; CHECK-NEXT: movq %rdi, 16(%rax)		; CHECK-NEXT: movq %rdi, 16(%rax)
; CHECK-NEXT: movq %rcx, 24(%rax)		; CHECK-NEXT: movq %r9, 24(%rax)
		; CHECK-NEXT: popq %rbx
		; CHECK-NEXT: .cfi_def_cfa_offset 8
; CHECK-NEXT: retq		; CHECK-NEXT: retq
entry:		entry:
%0 = extractvalue %S %arg.b, 0		%0 = extractvalue %S %arg.b, 0
%.elt6 = extractvalue [4 x i64] %0, 1		%.elt6 = extractvalue [4 x i64] %0, 1
%.elt8 = extractvalue [4 x i64] %0, 2		%.elt8 = extractvalue [4 x i64] %0, 2
%.elt10 = extractvalue [4 x i64] %0, 3		%.elt10 = extractvalue [4 x i64] %0, 3
%.elt = extractvalue [4 x i64] %0, 0		%.elt = extractvalue [4 x i64] %0, 0
%1 = getelementptr inbounds %S, %S* %this, i64 0, i32 0, i64 0		%1 = getelementptr inbounds %S, %S* %this, i64 0, i32 0, i64 0
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
ret i128 %2		ret i128 %2
}		}

define i128 @addcarry_to_subcarry(i64 %a, i64 %b) {		define i128 @addcarry_to_subcarry(i64 %a, i64 %b) {
; CHECK-LABEL: addcarry_to_subcarry:		; CHECK-LABEL: addcarry_to_subcarry:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: movq %rdi, %rax		; CHECK-NEXT: movq %rdi, %rax
; CHECK-NEXT: cmpq %rsi, %rdi		; CHECK-NEXT: cmpq %rsi, %rdi
; CHECK-NEXT: notq %rsi		; CHECK-NEXT: sbbq %rsi, %rax
; CHECK-NEXT: setae %cl		; CHECK-NEXT: setae %cl
; CHECK-NEXT: addb $-1, %cl
; CHECK-NEXT: adcq $0, %rax
; CHECK-NEXT: setb %cl
; CHECK-NEXT: movzbl %cl, %edx		; CHECK-NEXT: movzbl %cl, %edx
; CHECK-NEXT: addq %rsi, %rax
; CHECK-NEXT: adcq $0, %rdx
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%notb = xor i64 %b, -1		%notb = xor i64 %b, -1
%notb128 = zext i64 %notb to i128		%notb128 = zext i64 %notb to i128
%a128 = zext i64 %a to i128		%a128 = zext i64 %a to i128
%sum1 = add i128 %a128, 1		%sum1 = add i128 %a128, 1
%sub1 = add i128 %sum1, %notb128		%sub1 = add i128 %sum1, %notb128
%hi = lshr i128 %sub1, 64		%hi = lshr i128 %sub1, 64
%sum2 = add i128 %hi, %a128		%sum2 = add i128 %hi, %a128
%sub2 = add i128 %sum2, %notb128		%sub2 = add i128 %sum2, %notb128
ret i128 %sub2		ret i128 %sub2
}		}

test/CodeGen/X86/subcarry.ll

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	entry:
%31 = insertvalue %S undef, [4 x i64] %30, 0		%31 = insertvalue %S undef, [4 x i64] %30, 0
ret %S %31		ret %S %31
}		}

define %S @sub(%S* nocapture readonly %this, %S %arg.b) local_unnamed_addr {		define %S @sub(%S* nocapture readonly %this, %S %arg.b) local_unnamed_addr {
; CHECK-LABEL: sub:		; CHECK-LABEL: sub:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: movq %rdi, %rax		; CHECK-NEXT: movq %rdi, %rax
; CHECK-NEXT: movq (%rsi), %r10		; CHECK-NEXT: movq (%rsi), %rdi
; CHECK-NEXT: movq 8(%rsi), %rdi		; CHECK-NEXT: movq 8(%rsi), %r10
; CHECK-NEXT: subq %rdx, %r10		; CHECK-NEXT: subq %rdx, %rdi
; CHECK-NEXT: setae %dl		; CHECK-NEXT: sbbq %rcx, %r10
; CHECK-NEXT: addb $-1, %dl		; CHECK-NEXT: movq 16(%rsi), %rcx
; CHECK-NEXT: adcq $0, %rdi		; CHECK-NEXT: sbbq %r8, %rcx
; CHECK-NEXT: setb %dl		; CHECK-NEXT: movq 24(%rsi), %rdx
; CHECK-NEXT: movzbl %dl, %r11d		; CHECK-NEXT: sbbq %r9, %rdx
; CHECK-NEXT: notq %rcx		; CHECK-NEXT: movq %rdi, (%rax)
; CHECK-NEXT: addq %rdi, %rcx		; CHECK-NEXT: movq %r10, 8(%rax)
; CHECK-NEXT: adcq 16(%rsi), %r11		; CHECK-NEXT: movq %rcx, 16(%rax)
; CHECK-NEXT: setb %dl		; CHECK-NEXT: movq %rdx, 24(%rax)
; CHECK-NEXT: movzbl %dl, %edx
; CHECK-NEXT: notq %r8
; CHECK-NEXT: addq %r11, %r8
; CHECK-NEXT: adcq 24(%rsi), %rdx
; CHECK-NEXT: notq %r9
; CHECK-NEXT: addq %rdx, %r9
; CHECK-NEXT: movq %r10, (%rax)
; CHECK-NEXT: movq %rcx, 8(%rax)
; CHECK-NEXT: movq %r8, 16(%rax)
; CHECK-NEXT: movq %r9, 24(%rax)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
entry:		entry:
%0 = extractvalue %S %arg.b, 0		%0 = extractvalue %S %arg.b, 0
%.elt6 = extractvalue [4 x i64] %0, 1		%.elt6 = extractvalue [4 x i64] %0, 1
%.elt8 = extractvalue [4 x i64] %0, 2		%.elt8 = extractvalue [4 x i64] %0, 2
%.elt10 = extractvalue [4 x i64] %0, 3		%.elt10 = extractvalue [4 x i64] %0, 3
%.elt = extractvalue [4 x i64] %0, 0		%.elt = extractvalue [4 x i64] %0, 0
%1 = getelementptr inbounds %S, %S* %this, i64 0, i32 0, i64 0		%1 = getelementptr inbounds %S, %S* %this, i64 0, i32 0, i64 0
▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines