This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Transforms/Utils/
-
lib/
-
Transforms/
-
Utils/
-
Local.cpp

Differential D87408

[NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32 PHI's, O(n^2) algo is faster (geomean -0.08%)
ClosedPublic

Authored by lebedev.ri on Sep 9 2020, 12:11 PM.

Download Raw Diff

Details

Reviewers

fhahn
nikic
spatel
efriedma
reames

Commits

rGaadf55d1cea2: [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32…

Summary

This is functionally equivalent to the old implementation.

As per https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=4739e6e4eb54d3736e6457249c0919b30f6c855a&stat=instructions
this is a clear geomean compile-time regression-free win with overall geomean of -0.08%

32 PHI's appears to be the sweet spot; both the 16 and 64 performed worse:
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=c4efe1fbbfdf0305ac26cd19eacb0c7774cdf60e&stat=instructions
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=e4989d1c67010d3339d1a40ff5286a31f10cfe82&stat=instructions

If we have more PHI's than that, we fall-back to the original DenseSet-based implementation,
so the not-so-fast cases will still be handled.

However compile-time isn't the main motivation here.
I can name at least 3 limitations of this CSE:

Assumes that all PHI nodes have incoming basic blocks in the same order (can be fixed while keeping the DenseMap)
Does not special-handle undef incoming values (i don't see how we can do this with hashing)
Does not special-handle backedge incoming values (maybe can be fixed by hashing backedge as some magical value)

Thoughts?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Sep 9 2020, 12:11 PM

Herald added subscribers: dexonsmith, hiraditya. · View Herald TranscriptSep 9 2020, 12:11 PM

lebedev.ri requested review of this revision.Sep 9 2020, 12:11 PM

Drop some more dead code..

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 12:18 PM

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 12:29 PM

Harbormaster completed remote builds in B71131: Diff 290793.Sep 9 2020, 1:02 PM

Harbormaster completed remote builds in B71129: Diff 290789.

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before. It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

In D87408#2264270, @nikic wrote:

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before.

Yes.

It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

Before doing that, there is one more thing i guess i should try first - it doesn't make sense to reprocess *all* phis,
we should partition them by the type, and only reprocess the group in which we CSE'd.

Will update tomorrow once http://llvm-compile-time-tracker.com/ processes variants.

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 11:48 PM

In D87408#2264270, @nikic wrote:

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before. It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

Ok, i feel dumber than usual now.
I've accidentally pushed more than the commit in question, so those numbers included some unrelated changes.
The real numbers are: https://llvm-compile-time-tracker.com/compare.php?from=25f3cc0ced1759af1911c2446ac40fab4f5e5571&to=5f12d8b73ac75a487c454f4a197f6bb69305bfed&stat=instructions
And it's much more clear win than what we've looked at https://llvm-compile-time-tracker.com/compare.php?from=25f3cc0ced1759af1911c2446ac40fab4f5e5571&to=b61307f6f13cb1020d0940c2d3103a0a2493b2ce&stat=instructions
ReleaseLTO-g is now also a -0.08 improvement.
The only compile-time regressions are:

-O3: tramp3d-v4 118935M 118946M (+0.01%)
ReleaseLTO-g (link only): kimwitu++ 38596M 38620M (+0.06%)
ReleaseLTO-g (link only): consumer-typeset 41942M 41949M (+0.02%)

With that in light, do we still want cut-offs?

lebedev.ri requested review of this revision.Sep 9 2020, 11:52 PM

The point of the cutoff wouldn't be to handle anything in the LLVM testsuite; it would be to prevent it from blowing up on someone's giant machine-generated function which isn't represented in the testuite.

lebedev.ri retitled this revision from [NFC] EliminateDuplicatePHINodes(): drop DenseMap-driven CSE in favor of quadratic algorithmn to [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32 PHI's, O(n^2) algo is faster (geomean -0.08%).Sep 13 2020, 2:09 PM

lebedev.ri edited the summary of this revision. (Show Details)

This comment was removed by lebedev.ri.

In D87408#2269206, @efriedma wrote:

The point of the cutoff wouldn't be to handle anything in the LLVM testsuite; it would be to prevent it from blowing up on someone's giant machine-generated function which isn't represented in the testuite.

Done, thanks, PTAL.

Harbormaster completed remote builds in B71516: Diff 291473.Sep 13 2020, 2:50 PM

LGTM

This revision is now accepted and ready to land.Sep 16 2020, 7:19 PM

In D87408#2278386, @efriedma wrote:

LGTM

Cool, thank you for the review!

This revision was landed with ongoing or failed builds.Sep 17 2020, 1:29 AM

Closed by commit rGaadf55d1cea2: [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri added a commit: rGaadf55d1cea2: [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32….

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Utils/

Local.cpp

55 lines

Diff 292425

llvm/lib/Transforms/Utils/Local.cpp

Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	#ifdef EXPENSIVE_CHECKS
cl::init(true),		cl::init(true),
#else		#else
cl::init(false),		cl::init(false),
#endif		#endif
cl::Hidden,		cl::Hidden,
cl::desc("Perform extra assertion checking to verify that PHINodes's hash "		cl::desc("Perform extra assertion checking to verify that PHINodes's hash "
"function is well-behaved w.r.t. its isEqual predicate"));		"function is well-behaved w.r.t. its isEqual predicate"));

		static cl::opt<unsigned> PHICSENumPHISmallSize(
		"phicse-num-phi-smallsize", cl::init(32), cl::Hidden,
		cl::desc(
		"When the basic block contains not more than this number of PHI nodes, "
		"perform a (faster!) exhaustive search instead of set-driven one."));

// Max recursion depth for collectBitParts used when detecting bswap and		// Max recursion depth for collectBitParts used when detecting bswap and
// bitreverse idioms		// bitreverse idioms
static const unsigned BitPartRecursionMaxDepth = 64;		static const unsigned BitPartRecursionMaxDepth = 64;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Local constant propagation.		// Local constant propagation.
//		//

▲ Show 20 Lines • Show All 1,012 Lines • ▼ Show 20 Lines	if (DTU) {
DTU->applyUpdatesPermissive(Updates);		DTU->applyUpdatesPermissive(Updates);
DTU->deleteBB(BB);		DTU->deleteBB(BB);
} else {		} else {
BB->eraseFromParent(); // Delete the old basic block.		BB->eraseFromParent(); // Delete the old basic block.
}		}
return true;		return true;
}		}

// WARNING: this logic must be kept in sync with		static bool EliminateDuplicatePHINodesNaiveImpl(BasicBlock *BB) {
// Instruction::isIdenticalToWhenDefined()!		// This implementation doesn't currently consider undef operands
bool llvm::EliminateDuplicatePHINodes(BasicBlock *BB) {		// specially. Theoretically, two phis which are identical except for
		// one having an undef where the other doesn't could be collapsed.

		bool Changed = false;

		// Examine each PHI.
		// Note that increment of I must NOT be in the iteration_expression, since
		// we don't want to immediately advance when we restart from the beginning.
		for (auto I = BB->begin(); PHINode *PN = dyn_cast<PHINode>(I);) {
		++I;
		// Is there an identical PHI node in this basic block?
		// Note that we only look in the upper square's triangle,
		// we already checked that the lower triangle PHI's aren't identical.
		for (auto J = I; PHINode *DuplicatePN = dyn_cast<PHINode>(J); ++J) {
		if (!DuplicatePN->isIdenticalToWhenDefined(PN))
		continue;
		// A duplicate. Replace this PHI with the base PHI.
		++NumPHICSEs;
		DuplicatePN->replaceAllUsesWith(PN);
		DuplicatePN->eraseFromParent();
		Changed = true;

		// The RAUW can change PHIs that we already visited.
		I = BB->begin();
		break; // Start over from the beginning.
		}
		}
		return Changed;
		}

		static bool EliminateDuplicatePHINodesSetBasedImpl(BasicBlock *BB) {
// This implementation doesn't currently consider undef operands		// This implementation doesn't currently consider undef operands
// specially. Theoretically, two phis which are identical except for		// specially. Theoretically, two phis which are identical except for
// one having an undef where the other doesn't could be collapsed.		// one having an undef where the other doesn't could be collapsed.

struct PHIDenseMapInfo {		struct PHIDenseMapInfo {
static PHINode *getEmptyKey() {		static PHINode *getEmptyKey() {
return DenseMapInfo<PHINode *>::getEmptyKey();		return DenseMapInfo<PHINode *>::getEmptyKey();
}		}

static PHINode *getTombstoneKey() {		static PHINode *getTombstoneKey() {
return DenseMapInfo<PHINode *>::getTombstoneKey();		return DenseMapInfo<PHINode *>::getTombstoneKey();
}		}

static bool isSentinel(PHINode *PN) {		static bool isSentinel(PHINode *PN) {
return PN == getEmptyKey() \|\| PN == getTombstoneKey();		return PN == getEmptyKey() \|\| PN == getTombstoneKey();
}		}

		// WARNING: this logic must be kept in sync with
		// Instruction::isIdenticalToWhenDefined()!
static unsigned getHashValueImpl(PHINode *PN) {		static unsigned getHashValueImpl(PHINode *PN) {
// Compute a hash value on the operands. Instcombine will likely have		// Compute a hash value on the operands. Instcombine will likely have
// sorted them, which helps expose duplicates, but we have to check all		// sorted them, which helps expose duplicates, but we have to check all
// the operands to be safe in case instcombine hasn't run.		// the operands to be safe in case instcombine hasn't run.
return static_cast<unsigned>(hash_combine(		return static_cast<unsigned>(hash_combine(
hash_combine_range(PN->value_op_begin(), PN->value_op_end()),		hash_combine_range(PN->value_op_begin(), PN->value_op_end()),
hash_combine_range(PN->block_begin(), PN->block_end())));		hash_combine_range(PN->block_begin(), PN->block_end())));
}		}
Show All 23 Lines	static bool isEqual(PHINode LHS, PHINode RHS) {
assert(!Result \|\| (isSentinel(LHS) && LHS == RHS) \|\|		assert(!Result \|\| (isSentinel(LHS) && LHS == RHS) \|\|
getHashValueImpl(LHS) == getHashValueImpl(RHS));		getHashValueImpl(LHS) == getHashValueImpl(RHS));
return Result;		return Result;
}		}
};		};

// Set of unique PHINodes.		// Set of unique PHINodes.
DenseSet<PHINode *, PHIDenseMapInfo> PHISet;		DenseSet<PHINode *, PHIDenseMapInfo> PHISet;
		PHISet.reserve(4 * PHICSENumPHISmallSize);

// Examine each PHI.		// Examine each PHI.
bool Changed = false;		bool Changed = false;
for (auto I = BB->begin(); PHINode *PN = dyn_cast<PHINode>(I++);) {		for (auto I = BB->begin(); PHINode *PN = dyn_cast<PHINode>(I++);) {
auto Inserted = PHISet.insert(PN);		auto Inserted = PHISet.insert(PN);
if (!Inserted.second) {		if (!Inserted.second) {
// A duplicate. Replace this PHI with its duplicate.		// A duplicate. Replace this PHI with its duplicate.
++NumPHICSEs;		++NumPHICSEs;
PN->replaceAllUsesWith(*Inserted.first);		PN->replaceAllUsesWith(*Inserted.first);
PN->eraseFromParent();		PN->eraseFromParent();
Changed = true;		Changed = true;

// The RAUW can change PHIs that we already visited. Start over from the		// The RAUW can change PHIs that we already visited. Start over from the
// beginning.		// beginning.
PHISet.clear();		PHISet.clear();
I = BB->begin();		I = BB->begin();
}		}
}		}

return Changed;		return Changed;
}		}

		bool llvm::EliminateDuplicatePHINodes(BasicBlock *BB) {
		if (
		#ifndef NDEBUG
		!PHICSEDebugHash &&
		#endif
		hasNItemsOrLess(BB->phis(), PHICSENumPHISmallSize))
		return EliminateDuplicatePHINodesNaiveImpl(BB);
		return EliminateDuplicatePHINodesSetBasedImpl(BB);
		}

/// enforceKnownAlignment - If the specified pointer points to an object that		/// enforceKnownAlignment - If the specified pointer points to an object that
/// we control, modify the object's alignment to PrefAlign. This isn't		/// we control, modify the object's alignment to PrefAlign. This isn't
/// often possible though. If alignment is important, a more reliable approach		/// often possible though. If alignment is important, a more reliable approach
/// is to simply align all global variables and allocation instructions to		/// is to simply align all global variables and allocation instructions to
/// their preferred alignment from the beginning.		/// their preferred alignment from the beginning.
static Align enforceKnownAlignment(Value *V, Align Alignment, Align PrefAlign,		static Align enforceKnownAlignment(Value *V, Align Alignment, Align PrefAlign,
const DataLayout &DL) {		const DataLayout &DL) {
assert(PrefAlign > Alignment);		assert(PrefAlign > Alignment);
▲ Show 20 Lines • Show All 1,886 Lines • Show Last 20 Lines