This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
9/9
ComplexDeinterleavingPass.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
8/9
complex-deinterleaving-multiuses.ll

Differential D146988

[CodeGen] Enable processing of interconnected complex number operations
ClosedPublic

Authored by igor.kirillov on Mar 27 2023, 10:18 AM.

Download Raw Diff

Details

Reviewers

NickGuy
huntergr
mgabka

Commits

rGc692e87ab8e7: [CodeGen] Enable processing of interconnected complex number operations

Summary

With this patch, ComplexDeinterleavingPass now has the ability to handle
any number of interconnected operations involving complex numbers.
For example, the patch enables the processing of code like the following:

for (int i = 0; i < 1000; ++i) {

a[i] =  w[i] * v[i];
b[i] =  w[i] * u[i];

}

This code has multiple arrays containing complex numbers and a common
subexpression w that appears in two expressions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

igor.kirillov created this revision.Mar 27 2023, 10:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 27 2023, 10:18 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

igor.kirillov requested review of this revision.Mar 27 2023, 10:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 27 2023, 10:18 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Remove old comment

igor.kirillov added reviewers: NickGuy, huntergr.Mar 27 2023, 10:47 AM

Harbormaster completed remote builds in B222069: Diff 508727.Mar 27 2023, 1:06 PM

NickGuy added inline comments.Mar 28 2023, 3:29 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
143	What's the rationale behind removing this? I might be missing something, but it looks like you're removing it then looking for the internal instructions on-demand in `checkNodes` later
908–913	How is this expected to behave with operands that leave the chain? (i.e. Phi nodes or instructions in a different basic block). Is there a risk of `AllInstructions` containing the most of the function in some cases?

I am planning to add several more patches after this one, including:

Support for scalable vectors
Reductions
Full -Ofast mode

With regards -Ofast, the current code is suitable for simple cases where only one multiplication or addition is involved. However, when -Ofast flag is set, the compiler may rearrange the order of instructions, causing the real and imaginary parts to not run in parallel. For example, a loop like the following may not be processed:

for (int i = 0; i < N; ++i)
  u[i] = v[i] * w[i] + u[i] * y[i];

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
143	Ah, yes I should explain. I'll be adding support for -Ofast (see top comment), which might cause some issues. Basically, the problem is that the `ComplexDeinterleavingNode` can't be attached to `ComplexDeinterleavingNode::Imag` and `ComplexDeinterleavingNode::Real` instructions, which means that the `submitCompositeNode` function won't be able to fill `InternalInstructions`. To solve this, I've decided to split the detection and checking process into two stages. I think it is more straightforward.
908–913	PHINodes will also act as as leaves and be in FinalInstructions, and instructions from different block are not problematic as long as we ensure that there are no other uses, which we do. In case that a loop has a large number instructions, it will still be processed in the same manner and I don't see any problems. As FCMLA/FCADD are vector instructions, our focus is mainly on vector loops. Therefore, I think it is more beneficial to perform this check on a BasicBlock level rather then a Function level. Alternatively, we could apply this pass per Loop, but this would result in losing support for some Neon cases that can be generated by the autovectorizer outside of a loop.

igor.kirillov added a child revision: D147451: [CodeGen] Enable AArch64 SVE FCMLA/FCADD instruction generation in ComplexDeinterleaving.Apr 3 2023, 9:41 AM

Matt added a subscriber: Matt.Apr 3 2023, 1:56 PM

mgabka added a subscriber: mgabka.Apr 4 2023, 1:58 AM

igor.kirillov added a reviewer: mgabka.Apr 4 2023, 6:42 AM

LGTM

In D146988#4227451, @igor.kirillov wrote:

As FCMLA/FCADD are vector instructions, our focus is mainly on vector loops. Therefore, I think it is more beneficial to perform this check on a BasicBlock level rather then a Function level. Alternatively, we could apply this pass per Loop, but this would result in losing support for some Neon cases that can be generated by the autovectorizer outside of a loop.

That makes sense in this case, but keep in mind that scalar loops (or in the case of neon & complex doubles, even non-looping operations) can be lowered to use xcmla/xcadd, I'm not sure we should rely on the concept of a loop for this.
The scalar loops and co. are something that we can look into in the future though, trying to do so now will only complicate things and bloat the patch.

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
3	Nit: Probably don't need `+fullfp16` in this test, as `half` isn't being used.

This revision is now accepted and ready to land.Apr 4 2023, 7:12 AM

Hello. Nick pointed me at the patch but I haven't looked at the details a huge amount. checkNodes feels quite complex, but handling multiple users seems like a nice addition to the pass. The pass should be able to handle multiple basic blocks just fine, but limiting multiple uses to all start in the same block sounds like a good compromise between complexity and functionality.

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
895	Worklist is a commonly used name in LLVM
924	A user of an instruction will always be an instruction.
929	This seems like it adds the user `OpI` to `OuterInstructions`, just so that later is can look at the operands and add `I` to `ToDo`. Can it just add `I` directly?
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
94	It is not obvious to me why d * c should prevent the rest of the tree transforming. Can you precommit the tests to show the differences in the review?

NickGuy added inline comments.Apr 5 2023, 8:14 AM

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
94	I'm not sure it's the `d * c` that's preventing it, so much as the store to %p1 (It doesn't reinterleave, so the store is deemed "outside the graph"). Though I agree, pre-commits would be useful to see the difference in the other cases

igor.kirillov added inline comments.Apr 5 2023, 1:57 PM

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
94	The idea is that any external use of complex computation chain prevents deinterleaving. And if one chain stops to be deinterleaved it might effect another chain, which is shown in this test. p1 prevent for ab to be deinterleaved, which in its turn prevents (a * b * c) to be deinterleaved and that in its turn spoils dc expression. As result nothing is deinterleaved. At least that's what I thought before pre-commiting this tests where pre-patch code decided to deinterleaved d c operation - https://reviews.llvm.org/D147659. Now, I have a question. If shufflevector is used outside, should we still deinterleave code?

NickGuy requested changes to this revision.Apr 6 2023, 3:51 AM

NickGuy added inline comments.

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
94	If shufflevector is used outside, should we still deinterleave code? As long as the shufflevector used is the reinterleaving one we're fine to do so, as it should simply replace only what it can and preserve the rest, but that feels very much like a cost model problem to me. The idea is that any external use of complex computation chain prevents deinterleaving. And if one chain stops to be deinterleaved it might effect another chain, which is shown in this test. p1 prevent for ab to be deinterleaved, which in its turn prevents (a * b * c) to be deinterleaved and that in its turn spoils dc expression. As result nothing is deinterleaved. In this case though, I don't see a reason why `d c` shouldn't be replaced, it's not part of any other expression so should be able to be replaced in isolation. For that matter, the assignment to p1 is the only expression that I'd expect not to be transformed, the rest should be able to go through fine (it would be a different story if it was something like `p2 = (a b) * p1`, but the lack of data dependencies between them means they can and should be treated in isolation of eachother)

This revision now requires changes to proceed.Apr 6 2023, 3:51 AM

dmgreen added inline comments.Apr 6 2023, 4:57 AM

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
94	Sorry I apparently forgot to send this as a reply to @igor.kirillov Yeah that was my point, but it will depend on whether the shuffles get folded into a ld2 or need to be replicated. On it's own replacing fadd/fmul(shuffle, shuffle) with fcmla will be profitable even if the shuffles have other uses. But if the shuffles could be combined into a ld2 it isn't as obvious. One of the earlier version of this pass has a costmodel that tried work out when it was profitable but I suggested removing it to simplify the patch, as it wasn't very useful at the time. I would guess that adding multiple disparate subgraphs into the same graph make that more difficult to do consistently, but in this case we can perhaps just check the shuffles operand to see if it is a load. It is probably worth having tests for both, and giving them better names.

igor.kirillov added inline comments.Apr 6 2023, 7:55 AM

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
94	@dmgreen I am not sure I understood correctly which tests to add. But anyway I readjusted this one and added a new that shows how external use might infect operation graph - https://reviews.llvm.org/D147659

igor.kirillov mentioned this in rGb983b44d36dd: [CodeGen] Add Pre-commit tests for D146988.Apr 11 2023, 4:48 AM

If final nodes have external users, it will not prevent complex deinterleaving. Refactoring

igor.kirillov marked 10 inline comments as done.Apr 11 2023, 6:26 AM

Harbormaster completed remote builds in B224759: Diff 512423.Apr 11 2023, 7:14 AM

NickGuy added inline comments.Apr 11 2023, 8:57 AM

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
924	Nit: If we know for a fact that it's an Instruction, it might be better to use `cast` rather than `dyn_cast`
956	As above, `cast` instead of `dyn_cast`
llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
4–5	Why is it not? The presence of `fcmla` in the expected output tells me that it is expected to transform

Update comment and change dyn_cast to cast where it was required

igor.kirillov marked 3 inline comments as done.Apr 11 2023, 9:11 AM

igor.kirillov added inline comments.

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll
4–5	Yes! Forgot to update after rebasing on top of the pre-commit.

Harbormaster completed remote builds in B224800: Diff 512486.Apr 11 2023, 9:59 AM

Ping?

Thanks for the ping, the recent changes passed me by completely.

This looks good to me, (with the caveat that cost modeling is missing at the moment). Though I might suggest waiting a couple more days before merging, in case someone notices anything else I missed :)

This revision is now accepted and ready to land.Apr 14 2023, 5:49 AM

Closed by commit rGc692e87ab8e7: [CodeGen] Enable processing of interconnected complex number operations (authored by igor.kirillov). · Explain WhyApr 18 2023, 6:06 AM

This revision was automatically updated to reflect the committed changes.

igor.kirillov added a commit: rGc692e87ab8e7: [CodeGen] Enable processing of interconnected complex number operations.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

ComplexDeinterleavingPass.cpp

199 lines

test/

CodeGen/

AArch64/

complex-deinterleaving-multiuses.ll

161 lines

Diff 508726

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp

Show First 20 Lines • Show All 134 Lines • ▼ Show 20 Lines
public:		public:
ComplexDeinterleavingOperation Operation;		ComplexDeinterleavingOperation Operation;
Instruction *Real;		Instruction *Real;
Instruction *Imag;		Instruction *Imag;

// Instructions that should only exist within this node, there should be no		// Instructions that should only exist within this node, there should be no
// users of these instructions outside the node. An example of these would be		// users of these instructions outside the node. An example of these would be
// the multiply instructions of a partial multiply operation.		// the multiply instructions of a partial multiply operation.
SmallVector<Instruction *> InternalInstructions;
NickGuyUnsubmitted Done Reply Inline Actions What's the rationale behind removing this? I might be missing something, but it looks like you're removing it then looking for the internal instructions on-demand in `checkNodes` later NickGuy: What's the rationale behind removing this? I might be missing something, but it looks like…
igor.kirillovAuthorUnsubmitted Done Reply Inline Actions Ah, yes I should explain. I'll be adding support for -Ofast (see top comment), which might cause some issues. Basically, the problem is that the `ComplexDeinterleavingNode` can't be attached to `ComplexDeinterleavingNode::Imag` and `ComplexDeinterleavingNode::Real` instructions, which means that the `submitCompositeNode` function won't be able to fill `InternalInstructions`. To solve this, I've decided to split the detection and checking process into two stages. I think it is more straightforward. igor.kirillov: Ah, yes I should explain. I'll be adding support for -Ofast (see top comment), which might…
ComplexDeinterleavingRotation Rotation;		ComplexDeinterleavingRotation Rotation;
SmallVector<RawNodePtr> Operands;		SmallVector<RawNodePtr> Operands;
Value *ReplacementNode = nullptr;		Value *ReplacementNode = nullptr;

void addInstruction(Instruction *I) { InternalInstructions.push_back(I); }
void addOperand(NodePtr Node) { Operands.push_back(Node.get()); }		void addOperand(NodePtr Node) { Operands.push_back(Node.get()); }

bool hasAllInternalUses(SmallPtrSet<Instruction *, 16> &AllInstructions);

void dump() { dump(dbgs()); }		void dump() { dump(dbgs()); }
void dump(raw_ostream &OS) {		void dump(raw_ostream &OS) {
auto PrintValue = [&](Value *V) {		auto PrintValue = [&](Value *V) {
if (V) {		if (V) {
OS << "\"";		OS << "\"";
V->print(OS, true);		V->print(OS, true);
OS << "\"\n";		OS << "\"\n";
} else		} else
Show All 15 Lines	void dump(raw_ostream &OS) {
PrintValue(ReplacementNode);		PrintValue(ReplacementNode);
OS << " Operation: " << (int)Operation << "\n";		OS << " Operation: " << (int)Operation << "\n";
OS << " Rotation: " << ((int)Rotation * 90) << "\n";		OS << " Rotation: " << ((int)Rotation * 90) << "\n";
OS << " Operands: \n";		OS << " Operands: \n";
for (const auto &Op : Operands) {		for (const auto &Op : Operands) {
OS << " - ";		OS << " - ";
PrintNodeRef(Op);		PrintNodeRef(Op);
}		}
OS << " InternalInstructions:\n";
for (const auto &I : InternalInstructions) {
OS << " - \"";
I->print(OS, true);
OS << "\"\n";
}
}		}
};		};

class ComplexDeinterleavingGraph {		class ComplexDeinterleavingGraph {
public:		public:
using NodePtr = ComplexDeinterleavingCompositeNode::NodePtr;		using NodePtr = ComplexDeinterleavingCompositeNode::NodePtr;
using RawNodePtr = ComplexDeinterleavingCompositeNode::RawNodePtr;		using RawNodePtr = ComplexDeinterleavingCompositeNode::RawNodePtr;
explicit ComplexDeinterleavingGraph(const TargetLowering *tl) : TL(tl) {}		explicit ComplexDeinterleavingGraph(const TargetLowering *TL,
		const TargetLibraryInfo *TLI)
		: TL(TL), TLI(TLI) {}

private:		private:
const TargetLowering *TL;		const TargetLowering *TL;
Instruction *RootValue;		const TargetLibraryInfo *TLI;
NodePtr RootNode;
SmallVector<NodePtr> CompositeNodes;		SmallVector<NodePtr> CompositeNodes;
SmallPtrSet<Instruction *, 16> AllInstructions;
		SmallPtrSet<Instruction *, 16> FinalInstructions;

		/// Root instructions are instructions from which complex computation starts
		std::map<Instruction *, NodePtr> RootToNode;

		/// Topologically sorted root instructions
		SmallVector<Instruction *, 1> OrderedRoots;

NodePtr prepareCompositeNode(ComplexDeinterleavingOperation Operation,		NodePtr prepareCompositeNode(ComplexDeinterleavingOperation Operation,
Instruction R, Instruction I) {		Instruction R, Instruction I) {
return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation, R,		return std::make_shared<ComplexDeinterleavingCompositeNode>(Operation, R,
I);		I);
}		}

NodePtr submitCompositeNode(NodePtr Node) {		NodePtr submitCompositeNode(NodePtr Node) {
CompositeNodes.push_back(Node);		CompositeNodes.push_back(Node);
AllInstructions.insert(Node->Real);
AllInstructions.insert(Node->Imag);
for (auto *I : Node->InternalInstructions)
AllInstructions.insert(I);
return Node;		return Node;
}		}

NodePtr getContainingComposite(Value R, Value I) {		NodePtr getContainingComposite(Value R, Value I) {
for (const auto &CN : CompositeNodes) {		for (const auto &CN : CompositeNodes) {
if (CN->Real == R && CN->Imag == I)		if (CN->Real == R && CN->Imag == I)
return CN;		return CN;
}		}
Show All 40 Lines	void dump(raw_ostream &OS) {
for (const auto &Node : CompositeNodes)		for (const auto &Node : CompositeNodes)
Node->dump(OS);		Node->dump(OS);
}		}

/// Returns false if the deinterleaving operation should be cancelled for the		/// Returns false if the deinterleaving operation should be cancelled for the
/// current graph.		/// current graph.
bool identifyNodes(Instruction *RootI);		bool identifyNodes(Instruction *RootI);

		/// Check that every instruction, from the roots to the leaves, has internal
		/// uses.
		bool checkNodes();

/// Perform the actual replacement of the underlying instruction graph.		/// Perform the actual replacement of the underlying instruction graph.
void replaceNodes();		void replaceNodes();
};		};

class ComplexDeinterleaving {		class ComplexDeinterleaving {
public:		public:
ComplexDeinterleaving(const TargetLowering tl, const TargetLibraryInfo tli)		ComplexDeinterleaving(const TargetLowering tl, const TargetLibraryInfo tli)
: TL(tl), TLI(tli) {}		: TL(tl), TLI(tli) {}
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	for (int Idx = 1; Idx < HalfNumElements; ++Idx) {
if (Mask[Idx] != (Idx * 2) + Offset)		if (Mask[Idx] != (Idx * 2) + Offset)
return false;		return false;
}		}

return true;		return true;
}		}

bool ComplexDeinterleaving::evaluateBasicBlock(BasicBlock *B) {		bool ComplexDeinterleaving::evaluateBasicBlock(BasicBlock *B) {
bool Changed = false;		ComplexDeinterleavingGraph Graph(TL, TLI);

SmallVector<Instruction *> DeadInstrRoots;

for (auto &I : *B) {		for (auto &I : *B) {
auto *SVI = dyn_cast<ShuffleVectorInst>(&I);		auto *SVI = dyn_cast<ShuffleVectorInst>(&I);
if (!SVI)		if (!SVI)
continue;		continue;

// Look for a shufflevector that takes separate vectors of the real and		// Look for a shufflevector that takes separate vectors of the real and
// imaginary components and recombines them into a single vector.		// imaginary components and recombines them into a single vector.
if (!isInterleavingMask(SVI->getShuffleMask()))		if (!isInterleavingMask(SVI->getShuffleMask()))
continue;		continue;

ComplexDeinterleavingGraph Graph(TL);		Graph.identifyNodes(SVI);
if (!Graph.identifyNodes(SVI))
continue;

Graph.replaceNodes();
DeadInstrRoots.push_back(SVI);
Changed = true;
}		}

for (const auto &I : DeadInstrRoots) {		if (Graph.checkNodes()) {
if (!I \|\| I->getParent() == nullptr)		Graph.replaceNodes();
continue;		return true;
llvm::RecursivelyDeleteTriviallyDeadInstructions(I, TLI);
}		}

return Changed;		return false;
}		}

ComplexDeinterleavingGraph::NodePtr		ComplexDeinterleavingGraph::NodePtr
ComplexDeinterleavingGraph::identifyNodeWithImplicitAdd(		ComplexDeinterleavingGraph::identifyNodeWithImplicitAdd(
Instruction Real, Instruction Imag,		Instruction Real, Instruction Imag,
std::pair<Instruction , Instruction > &PartialMatch) {		std::pair<Instruction , Instruction > &PartialMatch) {
LLVM_DEBUG(dbgs() << "identifyNodeWithImplicitAdd " << Real << " / " << Imag		LLVM_DEBUG(dbgs() << "identifyNodeWithImplicitAdd " << Real << " / " << Imag
<< "\n");		<< "\n");
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	if (!UncommonNode) {
return nullptr;		return nullptr;
}		}

NodePtr Node = prepareCompositeNode(		NodePtr Node = prepareCompositeNode(
ComplexDeinterleavingOperation::CMulPartial, Real, Imag);		ComplexDeinterleavingOperation::CMulPartial, Real, Imag);
Node->Rotation = Rotation;		Node->Rotation = Rotation;
Node->addOperand(CommonNode);		Node->addOperand(CommonNode);
Node->addOperand(UncommonNode);		Node->addOperand(UncommonNode);
Node->InternalInstructions.append(FNegs);
return submitCompositeNode(Node);		return submitCompositeNode(Node);
}		}

ComplexDeinterleavingGraph::NodePtr		ComplexDeinterleavingGraph::NodePtr
ComplexDeinterleavingGraph::identifyPartialMul(Instruction *Real,		ComplexDeinterleavingGraph::identifyPartialMul(Instruction *Real,
Instruction *Imag) {		Instruction *Imag) {
LLVM_DEBUG(dbgs() << "identifyPartialMul " << Real << " / " << Imag		LLVM_DEBUG(dbgs() << "identifyPartialMul " << Real << " / " << Imag
<< "\n");		<< "\n");
▲ Show 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	ComplexDeinterleavingGraph::identifyPartialMul(Instruction *Real,
NodePtr CommonRes = identifyNode(PartialMatch.first, PartialMatch.second);		NodePtr CommonRes = identifyNode(PartialMatch.first, PartialMatch.second);
if (!CommonRes) {		if (!CommonRes) {
LLVM_DEBUG(dbgs() << " - No CommonRes identified\n");		LLVM_DEBUG(dbgs() << " - No CommonRes identified\n");
return nullptr;		return nullptr;
}		}

NodePtr Node = prepareCompositeNode(		NodePtr Node = prepareCompositeNode(
ComplexDeinterleavingOperation::CMulPartial, Real, Imag);		ComplexDeinterleavingOperation::CMulPartial, Real, Imag);
Node->addInstruction(RealMulI);
Node->addInstruction(ImagMulI);
Node->Rotation = Rotation;		Node->Rotation = Rotation;
Node->addOperand(CommonRes);		Node->addOperand(CommonRes);
Node->addOperand(UncommonRes);		Node->addOperand(UncommonRes);
Node->addOperand(CNode);		Node->addOperand(CNode);
return submitCompositeNode(Node);		return submitCompositeNode(Node);
}		}

ComplexDeinterleavingGraph::NodePtr		ComplexDeinterleavingGraph::NodePtr
▲ Show 20 Lines • Show All 201 Lines • ▼ Show 20 Lines	if (!CheckDeinterleavingShuffle(ImagShuffle)) {
LLVM_DEBUG(dbgs() << " - ImagShuffle is invalid type.\n");		LLVM_DEBUG(dbgs() << " - ImagShuffle is invalid type.\n");
return nullptr;		return nullptr;
}		}

NodePtr PlaceholderNode =		NodePtr PlaceholderNode =
prepareCompositeNode(llvm::ComplexDeinterleavingOperation::Shuffle,		prepareCompositeNode(llvm::ComplexDeinterleavingOperation::Shuffle,
RealShuffle, ImagShuffle);		RealShuffle, ImagShuffle);
PlaceholderNode->ReplacementNode = RealShuffle->getOperand(0);		PlaceholderNode->ReplacementNode = RealShuffle->getOperand(0);
		FinalInstructions.insert(RealShuffle);
		FinalInstructions.insert(ImagShuffle);
return submitCompositeNode(PlaceholderNode);		return submitCompositeNode(PlaceholderNode);
}		}
if (RealShuffle \|\| ImagShuffle) {		if (RealShuffle \|\| ImagShuffle) {
LLVM_DEBUG(dbgs() << " - There's a shuffle where there shouldn't be.\n");		LLVM_DEBUG(dbgs() << " - There's a shuffle where there shouldn't be.\n");
return nullptr;		return nullptr;
}		}

auto *VTy = cast<FixedVectorType>(Real->getType());		auto *VTy = cast<FixedVectorType>(Real->getType());
Show All 19 Lines
}		}

bool ComplexDeinterleavingGraph::identifyNodes(Instruction *RootI) {		bool ComplexDeinterleavingGraph::identifyNodes(Instruction *RootI) {
Instruction *Real;		Instruction *Real;
Instruction *Imag;		Instruction *Imag;
if (!match(RootI, m_Shuffle(m_Instruction(Real), m_Instruction(Imag))))		if (!match(RootI, m_Shuffle(m_Instruction(Real), m_Instruction(Imag))))
return false;		return false;

RootValue = RootI;		auto RootNode = identifyNode(Real, Imag);
AllInstructions.insert(RootI);
RootNode = identifyNode(Real, Imag);

LLVM_DEBUG({		LLVM_DEBUG({
Function *F = RootI->getFunction();		Function *F = RootI->getFunction();
BasicBlock *B = RootI->getParent();		BasicBlock *B = RootI->getParent();
dbgs() << "Complex deinterleaving graph for " << F->getName()		dbgs() << "Complex deinterleaving graph for " << F->getName()
<< "::" << B->getName() << ".\n";		<< "::" << B->getName() << ".\n";
dump(dbgs());		dump(dbgs());
dbgs() << "\n";		dbgs() << "\n";
});		});

// Check all instructions have internal uses		if (RootNode) {
for (const auto &Node : CompositeNodes) {		RootToNode[RootI] = RootNode;
if (!Node->hasAllInternalUses(AllInstructions)) {		OrderedRoots.push_back(RootI);
LLVM_DEBUG(dbgs() << " - Invalid internal uses\n");		return true;
		}

return false;		return false;
}		}

		bool ComplexDeinterleavingGraph::checkNodes() {
		// Collect all instructions from roots to leaves
		SmallPtrSet<Instruction *, 16> AllInstructions;
		SmallVector<Instruction *, 8> ToDo;
		dmgreenUnsubmitted Done Reply Inline Actions Worklist is a commonly used name in LLVM dmgreen: Worklist is a commonly used name in LLVM
		for (auto *I : OrderedRoots)
		ToDo.push_back(I);

		// Extract all instructions that are used by all XCMLA/XCADD/ADD/SUB/NEG
		// chains
		while (!ToDo.empty()) {
		auto *I = ToDo.back();
		ToDo.pop_back();

		if (!AllInstructions.insert(I).second)
		continue;

		if (!FinalInstructions.count(I)) {
		for (Value *Op : I->operands()) {
		if (auto *OpI = dyn_cast<Instruction>(Op))
		ToDo.emplace_back(OpI);
		}
		}
		NickGuyUnsubmitted Done Reply Inline Actions How is this expected to behave with operands that leave the chain? (i.e. Phi nodes or instructions in a different basic block). Is there a risk of `AllInstructions` containing the most of the function in some cases? NickGuy: How is this expected to behave with operands that leave the chain? (i.e. Phi nodes or…
		igor.kirillovAuthorUnsubmitted Done Reply Inline Actions PHINodes will also act as as leaves and be in FinalInstructions, and instructions from different block are not problematic as long as we ensure that there are no other uses, which we do. In case that a loop has a large number instructions, it will still be processed in the same manner and I don't see any problems. As FCMLA/FCADD are vector instructions, our focus is mainly on vector loops. Therefore, I think it is more beneficial to perform this check on a BasicBlock level rather then a Function level. Alternatively, we could apply this pass per Loop, but this would result in losing support for some Neon cases that can be generated by the autovectorizer outside of a loop. igor.kirillov: PHINodes will also act as as leaves and be in FinalInstructions, and instructions from…
		}

		// Find instructions that have users outside of chain
		SmallVector<Instruction *, 2> OuterInstructions;
		for (auto *I : AllInstructions) {
		// Skip root nodes
		if (RootToNode.count(I))
		continue;

		for (User *U : I->users()) {
		if (auto *OpI = dyn_cast<Instruction>(U)) {
		dmgreenUnsubmitted Done Reply Inline Actions A user of an instruction will always be an instruction. dmgreen: A user of an instruction will always be an instruction.
		NickGuyUnsubmitted Done Reply Inline Actions Nit: If we know for a fact that it's an Instruction, it might be better to use `cast` rather than `dyn_cast` NickGuy: Nit: If we know for a fact that it's an Instruction, it might be better to use `cast` rather…
		if (AllInstructions.count(OpI))
		continue;

		// Found an instruction that is not used by XCMLA/XCADD chain
		OuterInstructions.emplace_back(OpI);
		dmgreenUnsubmitted Done Reply Inline Actions This seems like it adds the user `OpI` to `OuterInstructions`, just so that later is can look at the operands and add `I` to `ToDo`. Can it just add `I` directly? dmgreen: This seems like it adds the user `OpI` to `OuterInstructions`, just so that later is can look…
		}
		}
		}

		// If any instructions are found to be used outside, find and remove roots
		// that somehow connect to those instructions.
		SmallPtrSet<Instruction *, 16> Visited;
		for (Instruction *I : OuterInstructions) {
		for (Value *Op : I->operands()) {
		if (auto *OpI = dyn_cast<Instruction>(Op))
		ToDo.emplace_back(OpI);
		}
		}

		while (!ToDo.empty()) {
		auto *I = ToDo.back();
		ToDo.pop_back();
		if (!Visited.insert(I).second)
		continue;

		// Found an impacted root node. Removing it from the nodes to be
		// deinterleaved
		if (RootToNode.count(I)) {
		LLVM_DEBUG(dbgs() << "Instruction " << *I
		<< " could be deinterleaved but its chain of complex "
		"operations have an outside user\n");
		RootToNode.erase(I);
		NickGuyUnsubmitted Done Reply Inline Actions As above, `cast` instead of `dyn_cast` NickGuy: As above, `cast` instead of `dyn_cast`
		}

		if (!AllInstructions.count(I))
		continue;

		for (User *U : I->users()) {
		if (auto *OpI = dyn_cast<Instruction>(U))
		ToDo.emplace_back(OpI);
		}

		if (FinalInstructions.count(I))
		continue;

		for (Value *Op : I->operands()) {
		if (auto *OpI = dyn_cast<Instruction>(Op))
		ToDo.emplace_back(OpI);
		}
}		}
return RootNode != nullptr;		return !RootToNode.empty();
}		}

static Value *replaceSymmetricNode(ComplexDeinterleavingGraph::RawNodePtr Node,		static Value *replaceSymmetricNode(ComplexDeinterleavingGraph::RawNodePtr Node,
Value InputA, Value InputB) {		Value InputA, Value InputB) {
Instruction *I = Node->Real;		Instruction *I = Node->Real;
if (I->isUnaryOp())		if (I->isUnaryOp())
assert(!InputB &&		assert(!InputB &&
"Unary symmetric operations need one input, but two were provided.");		"Unary symmetric operations need one input, but two were provided.");
Show All 40 Lines	Node->ReplacementNode = TL->createComplexDeinterleavingIR(
Accumulator);		Accumulator);

assert(Node->ReplacementNode && "Target failed to create Intrinsic call.");		assert(Node->ReplacementNode && "Target failed to create Intrinsic call.");
NumComplexTransformations += 1;		NumComplexTransformations += 1;
return Node->ReplacementNode;		return Node->ReplacementNode;
}		}

void ComplexDeinterleavingGraph::replaceNodes() {		void ComplexDeinterleavingGraph::replaceNodes() {
		SmallVector<Instruction *, 16> DeadInstrRoots;
		for (auto *RootInstruction : OrderedRoots) {
		// Check if this potential root went through check process and we can
		// deinterleave it
		if (!RootToNode.count(RootInstruction))
		continue;

		IRBuilder<> Builder(RootInstruction);
		auto RootNode = RootToNode[RootInstruction];
Value *R = replaceNode(RootNode.get());		Value *R = replaceNode(RootNode.get());
assert(R && "Unable to find replacement for RootValue");		assert(R && "Unable to find replacement for RootInstruction");
RootValue->replaceAllUsesWith(R);		DeadInstrRoots.push_back(RootInstruction);
		RootInstruction->replaceAllUsesWith(R);
}		}

bool ComplexDeinterleavingCompositeNode::hasAllInternalUses(		for (auto *I : DeadInstrRoots)
SmallPtrSet<Instruction *, 16> &AllInstructions) {		RecursivelyDeleteTriviallyDeadInstructions(I, TLI);
if (Operation == ComplexDeinterleavingOperation::Shuffle)
return true;

for (auto *User : Real->users()) {
if (!AllInstructions.contains(cast<Instruction>(User)))
return false;
}
for (auto *User : Imag->users()) {
if (!AllInstructions.contains(cast<Instruction>(User)))
return false;
}
for (auto *I : InternalInstructions) {
for (auto *User : I->users()) {
if (!AllInstructions.contains(cast<Instruction>(User)))
return false;
}
}
return true;
}		}

llvm/test/CodeGen/AArch64/complex-deinterleaving-multiuses.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s --mattr=+complxnum,+neon,+fullfp16 -o - \| FileCheck %s

				NickGuyUnsubmitted Done Reply Inline Actions Nit: Probably don't need `+fullfp16` in this test, as `half` isn't being used. NickGuy: Nit: Probably don't need `+fullfp16` in this test, as `half` isn't being used.
				target triple = "aarch64-arm-none-eabi"
				; Expected to transform
				NickGuyUnsubmitted Not Done Reply Inline Actions Why is it not? The presence of `fcmla` in the expected output tells me that it is expected to transform NickGuy: Why is it not? The presence of `fcmla` in the expected output tells me that it is expected to…
				igor.kirillovAuthorUnsubmitted Done Reply Inline Actions Yes! Forgot to update after rebasing on top of the pre-commit. igor.kirillov: Yes! Forgot to update after rebasing on top of the pre-commit.
				; p = (a b);
				; return (a * b) * a
				define <4 x float> @mul_triangle(<4 x float> %a, <4 x float> %b, ptr %p) {
				; CHECK-LABEL: mul_triangle:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: movi v3.2d, #0000000000000000
				; CHECK-NEXT: movi v2.2d, #0000000000000000
				; CHECK-NEXT: fcmla v3.4s, v1.4s, v0.4s, #0
				; CHECK-NEXT: fcmla v3.4s, v1.4s, v0.4s, #90
				; CHECK-NEXT: fcmla v2.4s, v0.4s, v3.4s, #0
				; CHECK-NEXT: str q3, [x0]
				; CHECK-NEXT: fcmla v2.4s, v0.4s, v3.4s, #90
				; CHECK-NEXT: mov v0.16b, v2.16b
				; CHECK-NEXT: ret
				entry:
				%strided.vec = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec35 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%strided.vec37 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec38 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x float> %strided.vec37, %strided.vec
				%1 = fmul fast <2 x float> %strided.vec38, %strided.vec35
				%2 = fsub fast <2 x float> %0, %1
				%3 = fmul fast <2 x float> %2, %strided.vec35
				%4 = fmul fast <2 x float> %strided.vec38, %strided.vec
				%5 = fmul fast <2 x float> %strided.vec35, %strided.vec37
				%6 = fadd fast <2 x float> %4, %5
				%otheruse = shufflevector <2 x float> %2, <2 x float> %6, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				store <4 x float> %otheruse, ptr %p
				%7 = fmul fast <2 x float> %6, %strided.vec
				%8 = fadd fast <2 x float> %3, %7
				%9 = fmul fast <2 x float> %2, %strided.vec
				%10 = fmul fast <2 x float> %6, %strided.vec35
				%11 = fsub fast <2 x float> %9, %10
				%interleaved.vec = shufflevector <2 x float> %11, <2 x float> %8, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x float> %interleaved.vec
				}

				; Expected to not transform
				; p = (a b).real();
				; return (a * b) * a
				define <4 x float> @mul_triangle_external_use(<4 x float> %a, <4 x float> %b, ptr %p) {
				; CHECK-LABEL: mul_triangle_external_use:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: ext v2.16b, v0.16b, v0.16b, #8
				; CHECK-NEXT: ext v3.16b, v1.16b, v1.16b, #8
				; CHECK-NEXT: zip2 v4.2s, v0.2s, v2.2s
				; CHECK-NEXT: zip1 v0.2s, v0.2s, v2.2s
				; CHECK-NEXT: zip1 v5.2s, v1.2s, v3.2s
				; CHECK-NEXT: zip2 v1.2s, v1.2s, v3.2s
				; CHECK-NEXT: fmul v2.2s, v4.2s, v5.2s
				; CHECK-NEXT: fmul v3.2s, v1.2s, v4.2s
				; CHECK-NEXT: fmla v2.2s, v0.2s, v1.2s
				; CHECK-NEXT: fneg v1.2s, v3.2s
				; CHECK-NEXT: fmul v3.2s, v2.2s, v4.2s
				; CHECK-NEXT: fmla v1.2s, v0.2s, v5.2s
				; CHECK-NEXT: fmul v5.2s, v2.2s, v0.2s
				; CHECK-NEXT: str d2, [x0]
				; CHECK-NEXT: fneg v3.2s, v3.2s
				; CHECK-NEXT: fmla v5.2s, v4.2s, v1.2s
				; CHECK-NEXT: fmla v3.2s, v0.2s, v1.2s
				; CHECK-NEXT: zip1 v0.4s, v3.4s, v5.4s
				; CHECK-NEXT: ret
				entry:
				%strided.vec = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec35 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%strided.vec37 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec38 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x float> %strided.vec37, %strided.vec
				%1 = fmul fast <2 x float> %strided.vec38, %strided.vec35
				%2 = fsub fast <2 x float> %0, %1
				%3 = fmul fast <2 x float> %2, %strided.vec35
				%4 = fmul fast <2 x float> %strided.vec38, %strided.vec
				%5 = fmul fast <2 x float> %strided.vec35, %strided.vec37
				%6 = fadd fast <2 x float> %4, %5
				store <2 x float> %6, ptr %p
				%7 = fmul fast <2 x float> %6, %strided.vec
				%8 = fadd fast <2 x float> %3, %7
				%9 = fmul fast <2 x float> %2, %strided.vec
				%10 = fmul fast <2 x float> %6, %strided.vec35
				%11 = fsub fast <2 x float> %9, %10
				%interleaved.vec = shufflevector <2 x float> %11, <2 x float> %8, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x float> %interleaved.vec
				}

				; Expected to not transform. Shows that external use prevents deinterleaving whole chain.
				; p1 = (a b).real();
				; p2 = (a b) * c;
				; return d * c
				define <4 x float> @monster(<4 x float> %a, <4 x float> %b, <4 x float> %c, <4 x float> %d, ptr %p1, ptr %p2) {
				dmgreenUnsubmitted Done Reply Inline Actions It is not obvious to me why d * c should prevent the rest of the tree transforming. Can you precommit the tests to show the differences in the review? dmgreen: It is not obvious to me why d * c should prevent the rest of the tree transforming. Can you…
				NickGuyUnsubmitted Done Reply Inline Actions I'm not sure it's the `d * c` that's preventing it, so much as the store to %p1 (It doesn't reinterleave, so the store is deemed "outside the graph"). Though I agree, pre-commits would be useful to see the difference in the other cases NickGuy: I'm not sure it's the `d * c` that's preventing it, so much as the store to %p1 (It doesn't…
				igor.kirillovAuthorUnsubmitted Done Reply Inline Actions The idea is that any external use of complex computation chain prevents deinterleaving. And if one chain stops to be deinterleaved it might effect another chain, which is shown in this test. p1 prevent for ab to be deinterleaved, which in its turn prevents (a * b * c) to be deinterleaved and that in its turn spoils dc expression. As result nothing is deinterleaved. At least that's what I thought before pre-commiting this tests where pre-patch code decided to deinterleaved d c operation - https://reviews.llvm.org/D147659. Now, I have a question. If shufflevector is used outside, should we still deinterleave code? igor.kirillov: The idea is that any external use of complex computation chain prevents deinterleaving. And if…
				dmgreenUnsubmitted Done Reply Inline Actions Sorry I apparently forgot to send this as a reply to @igor.kirillov Yeah that was my point, but it will depend on whether the shuffles get folded into a ld2 or need to be replicated. On it's own replacing fadd/fmul(shuffle, shuffle) with fcmla will be profitable even if the shuffles have other uses. But if the shuffles could be combined into a ld2 it isn't as obvious. One of the earlier version of this pass has a costmodel that tried work out when it was profitable but I suggested removing it to simplify the patch, as it wasn't very useful at the time. I would guess that adding multiple disparate subgraphs into the same graph make that more difficult to do consistently, but in this case we can perhaps just check the shuffles operand to see if it is a load. It is probably worth having tests for both, and giving them better names. dmgreen: Sorry I apparently forgot to send this as a reply to @igor.kirillov Yeah that was my point…
				igor.kirillovAuthorUnsubmitted Done Reply Inline Actions @dmgreen I am not sure I understood correctly which tests to add. But anyway I readjusted this one and added a new that shows how external use might infect operation graph - https://reviews.llvm.org/D147659 igor.kirillov: @dmgreen I am not sure I understood correctly which tests to add. But anyway I readjusted this…
				NickGuyUnsubmitted Done Reply Inline Actions If shufflevector is used outside, should we still deinterleave code? As long as the shufflevector used is the reinterleaving one we're fine to do so, as it should simply replace only what it can and preserve the rest, but that feels very much like a cost model problem to me. The idea is that any external use of complex computation chain prevents deinterleaving. And if one chain stops to be deinterleaved it might effect another chain, which is shown in this test. p1 prevent for ab to be deinterleaved, which in its turn prevents (a * b * c) to be deinterleaved and that in its turn spoils dc expression. As result nothing is deinterleaved. In this case though, I don't see a reason why `d c` shouldn't be replaced, it's not part of any other expression so should be able to be replaced in isolation. For that matter, the assignment to p1 is the only expression that I'd expect not to be transformed, the rest should be able to go through fine (it would be a different story if it was something like `p2 = (a b) * p1`, but the lack of data dependencies between them means they can and should be treated in isolation of eachother) NickGuy: > If shufflevector is used outside, should we still deinterleave code? As long as the…
				; CHECK-LABEL: monster:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: ext v5.16b, v0.16b, v0.16b, #8
				; CHECK-NEXT: ext v6.16b, v1.16b, v1.16b, #8
				; CHECK-NEXT: ext v4.16b, v3.16b, v3.16b, #8
				; CHECK-NEXT: ext v7.16b, v2.16b, v2.16b, #8
				; CHECK-NEXT: zip2 v16.2s, v0.2s, v5.2s
				; CHECK-NEXT: zip1 v17.2s, v1.2s, v6.2s
				; CHECK-NEXT: zip2 v1.2s, v1.2s, v6.2s
				; CHECK-NEXT: zip1 v0.2s, v0.2s, v5.2s
				; CHECK-NEXT: zip2 v18.2s, v3.2s, v4.2s
				; CHECK-NEXT: zip2 v6.2s, v2.2s, v7.2s
				; CHECK-NEXT: zip1 v2.2s, v2.2s, v7.2s
				; CHECK-NEXT: zip1 v3.2s, v3.2s, v4.2s
				; CHECK-NEXT: fmul v5.2s, v17.2s, v16.2s
				; CHECK-NEXT: fmul v16.2s, v1.2s, v16.2s
				; CHECK-NEXT: fmul v4.2s, v18.2s, v6.2s
				; CHECK-NEXT: fmul v7.2s, v3.2s, v6.2s
				; CHECK-NEXT: fmla v5.2s, v0.2s, v1.2s
				; CHECK-NEXT: fneg v1.2s, v16.2s
				; CHECK-NEXT: fneg v4.2s, v4.2s
				; CHECK-NEXT: fmla v7.2s, v2.2s, v18.2s
				; CHECK-NEXT: fmla v1.2s, v0.2s, v17.2s
				; CHECK-NEXT: fmul v17.2s, v2.2s, v5.2s
				; CHECK-NEXT: fmul v0.2s, v6.2s, v5.2s
				; CHECK-NEXT: fmla v4.2s, v2.2s, v3.2s
				; CHECK-NEXT: fmla v17.2s, v1.2s, v6.2s
				; CHECK-NEXT: str d1, [x0]
				; CHECK-NEXT: fneg v16.2s, v0.2s
				; CHECK-NEXT: zip1 v0.4s, v4.4s, v7.4s
				; CHECK-NEXT: fmla v16.2s, v1.2s, v2.2s
				; CHECK-NEXT: st2 { v16.2s, v17.2s }, [x1]
				; CHECK-NEXT: ret
				entry:
				%strided.vec = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec88 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%strided.vec90 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec91 = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%0 = fmul fast <2 x float> %strided.vec91, %strided.vec
				%1 = fmul fast <2 x float> %strided.vec90, %strided.vec88
				%2 = fadd fast <2 x float> %0, %1
				%3 = fmul fast <2 x float> %strided.vec90, %strided.vec
				%4 = fmul fast <2 x float> %strided.vec91, %strided.vec88
				%5 = fsub fast <2 x float> %3, %4
				store <2 x float> %5, ptr %p1
				%strided.vec93 = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec94 = shufflevector <4 x float> %c, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%6 = fmul fast <2 x float> %strided.vec94, %5
				%7 = fmul fast <2 x float> %strided.vec93, %2
				%8 = fadd fast <2 x float> %6, %7
				%9 = fmul fast <2 x float> %strided.vec93, %5
				%10 = fmul fast <2 x float> %strided.vec94, %2
				%11 = fsub fast <2 x float> %9, %10
				%interleaved.vec = shufflevector <2 x float> %11, <2 x float> %8, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				store <4 x float> %interleaved.vec, ptr %p2
				%strided.vec96 = shufflevector <4 x float> %d, <4 x float> poison, <2 x i32> <i32 0, i32 2>
				%strided.vec97 = shufflevector <4 x float> %d, <4 x float> poison, <2 x i32> <i32 1, i32 3>
				%12 = fmul fast <2 x float> %strided.vec96, %strided.vec94
				%13 = fmul fast <2 x float> %strided.vec97, %strided.vec93
				%14 = fadd fast <2 x float> %13, %12
				%15 = fmul fast <2 x float> %strided.vec96, %strided.vec93
				%16 = fmul fast <2 x float> %strided.vec97, %strided.vec94
				%17 = fsub fast <2 x float> %15, %16
				%interleaved.vec98 = shufflevector <2 x float> %17, <2 x float> %14, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
				ret <4 x float> %interleaved.vec98
				}