This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/
-
lib/
-
IR/
-
Instruction.cpp
-
Transforms/Utils/
-
Utils/
-
Local.cpp

Differential D87408

[NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32 PHI's, O(n^2) algo is faster (geomean -0.08%)
ClosedPublic

Authored by lebedev.ri on Sep 9 2020, 12:11 PM.

Download Raw Diff

Details

Reviewers

fhahn
nikic
spatel
efriedma
reames

Commits

rGaadf55d1cea2: [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32…

Summary

This is functionally equivalent to the old implementation.

As per https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=4739e6e4eb54d3736e6457249c0919b30f6c855a&stat=instructions
this is a clear geomean compile-time regression-free win with overall geomean of -0.08%

32 PHI's appears to be the sweet spot; both the 16 and 64 performed worse:
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=c4efe1fbbfdf0305ac26cd19eacb0c7774cdf60e&stat=instructions
https://llvm-compile-time-tracker.com/compare.php?from=5f4e9bf6416e45eba483a4e5e263749989fdb3b3&to=e4989d1c67010d3339d1a40ff5286a31f10cfe82&stat=instructions

If we have more PHI's than that, we fall-back to the original DenseSet-based implementation,
so the not-so-fast cases will still be handled.

However compile-time isn't the main motivation here.
I can name at least 3 limitations of this CSE:

Assumes that all PHI nodes have incoming basic blocks in the same order (can be fixed while keeping the DenseMap)
Does not special-handle undef incoming values (i don't see how we can do this with hashing)
Does not special-handle backedge incoming values (maybe can be fixed by hashing backedge as some magical value)

Thoughts?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Sep 9 2020, 12:11 PM

Herald added subscribers: dexonsmith, hiraditya. · View Herald TranscriptSep 9 2020, 12:11 PM

lebedev.ri requested review of this revision.Sep 9 2020, 12:11 PM

Drop some more dead code..

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 12:18 PM

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 12:29 PM

Harbormaster completed remote builds in B71131: Diff 290793.Sep 9 2020, 1:02 PM

Harbormaster completed remote builds in B71129: Diff 290789.

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before. It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

In D87408#2264270, @nikic wrote:

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before.

Yes.

It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

Before doing that, there is one more thing i guess i should try first - it doesn't make sense to reprocess *all* phis,
we should partition them by the type, and only reprocess the group in which we CSE'd.

Will update tomorrow once http://llvm-compile-time-tracker.com/ processes variants.

lebedev.ri edited the summary of this revision. (Show Details)Sep 9 2020, 11:48 PM

In D87408#2264270, @nikic wrote:

The problem here is that this is O(n^2) in the number of phi nodes, rather than O(n log n). So this will be faster for average inputs, but potentially much slower for degenerate cases. That said, I can't say that I've encountered "block with ten thousand phi nodes" as a problem before. It might still make sense to limit this heuristically, e.g. by limiting the inner loop to at most 100 iterations (which may fail to CSE some phi nodes in degenerate cases, but will avoid quadratic blowup.)

Ok, i feel dumber than usual now.
I've accidentally pushed more than the commit in question, so those numbers included some unrelated changes.
The real numbers are: https://llvm-compile-time-tracker.com/compare.php?from=25f3cc0ced1759af1911c2446ac40fab4f5e5571&to=5f12d8b73ac75a487c454f4a197f6bb69305bfed&stat=instructions
And it's much more clear win than what we've looked at https://llvm-compile-time-tracker.com/compare.php?from=25f3cc0ced1759af1911c2446ac40fab4f5e5571&to=b61307f6f13cb1020d0940c2d3103a0a2493b2ce&stat=instructions
ReleaseLTO-g is now also a -0.08 improvement.
The only compile-time regressions are:

-O3: tramp3d-v4 118935M 118946M (+0.01%)
ReleaseLTO-g (link only): kimwitu++ 38596M 38620M (+0.06%)
ReleaseLTO-g (link only): consumer-typeset 41942M 41949M (+0.02%)

With that in light, do we still want cut-offs?

lebedev.ri requested review of this revision.Sep 9 2020, 11:52 PM

The point of the cutoff wouldn't be to handle anything in the LLVM testsuite; it would be to prevent it from blowing up on someone's giant machine-generated function which isn't represented in the testuite.

lebedev.ri retitled this revision from [NFC] EliminateDuplicatePHINodes(): drop DenseMap-driven CSE in favor of quadratic algorithmn to [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32 PHI's, O(n^2) algo is faster (geomean -0.08%).Sep 13 2020, 2:09 PM

lebedev.ri edited the summary of this revision. (Show Details)

This comment was removed by lebedev.ri.

In D87408#2269206, @efriedma wrote:

The point of the cutoff wouldn't be to handle anything in the LLVM testsuite; it would be to prevent it from blowing up on someone's giant machine-generated function which isn't represented in the testuite.

Done, thanks, PTAL.

Harbormaster completed remote builds in B71516: Diff 291473.Sep 13 2020, 2:50 PM

LGTM

This revision is now accepted and ready to land.Sep 16 2020, 7:19 PM

In D87408#2278386, @efriedma wrote:

LGTM

Cool, thank you for the review!

This revision was landed with ongoing or failed builds.Sep 17 2020, 1:29 AM

Closed by commit rGaadf55d1cea2: [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri added a commit: rGaadf55d1cea2: [NFC] EliminateDuplicatePHINodes(): small-size optimization: if there are <= 32….

Revision Contents

Path

Size

llvm/

lib/

IR/

Instruction.cpp

1 line

Transforms/

Utils/

Local.cpp

92 lines

Diff 290793

llvm/lib/IR/Instruction.cpp

Show First 20 Lines • Show All 482 Lines • ▼ Show 20 Lines	bool Instruction::isIdenticalToWhenDefined(const Instruction *I) const {
if (getNumOperands() == 0 && I->getNumOperands() == 0)		if (getNumOperands() == 0 && I->getNumOperands() == 0)
return haveSameSpecialState(this, I);		return haveSameSpecialState(this, I);

// We have two instructions of identical opcode and #operands. Check to see		// We have two instructions of identical opcode and #operands. Check to see
// if all operands are the same.		// if all operands are the same.
if (!std::equal(op_begin(), op_end(), I->op_begin()))		if (!std::equal(op_begin(), op_end(), I->op_begin()))
return false;		return false;

// WARNING: this logic must be kept in sync with EliminateDuplicatePHINodes()!
if (const PHINode *thisPHI = dyn_cast<PHINode>(this)) {		if (const PHINode *thisPHI = dyn_cast<PHINode>(this)) {
const PHINode *otherPHI = cast<PHINode>(I);		const PHINode *otherPHI = cast<PHINode>(I);
return std::equal(thisPHI->block_begin(), thisPHI->block_end(),		return std::equal(thisPHI->block_begin(), thisPHI->block_end(),
otherPHI->block_begin());		otherPHI->block_begin());
}		}

return haveSameSpecialState(this, I);		return haveSameSpecialState(this, I);
}		}
▲ Show 20 Lines • Show All 285 Lines • Show Last 20 Lines

llvm/lib/Transforms/Utils/Local.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines
using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "local"		#define DEBUG_TYPE "local"

STATISTIC(NumRemoved, "Number of unreachable basic blocks removed");		STATISTIC(NumRemoved, "Number of unreachable basic blocks removed");
STATISTIC(NumPHICSEs, "Number of PHI's that got CSE'd");		STATISTIC(NumPHICSEs, "Number of PHI's that got CSE'd");

static cl::opt<bool> PHICSEDebugHash(
"phicse-debug-hash",
#ifdef EXPENSIVE_CHECKS
cl::init(true),
#else
cl::init(false),
#endif
cl::Hidden,
cl::desc("Perform extra assertion checking to verify that PHINodes's hash "
"function is well-behaved w.r.t. its isEqual predicate"));

// Max recursion depth for collectBitParts used when detecting bswap and		// Max recursion depth for collectBitParts used when detecting bswap and
// bitreverse idioms		// bitreverse idioms
static const unsigned BitPartRecursionMaxDepth = 64;		static const unsigned BitPartRecursionMaxDepth = 64;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Local constant propagation.		// Local constant propagation.
//		//

▲ Show 20 Lines • Show All 1,012 Lines • ▼ Show 20 Lines	if (DTU) {
DTU->applyUpdatesPermissive(Updates);		DTU->applyUpdatesPermissive(Updates);
DTU->deleteBB(BB);		DTU->deleteBB(BB);
} else {		} else {
BB->eraseFromParent(); // Delete the old basic block.		BB->eraseFromParent(); // Delete the old basic block.
}		}
return true;		return true;
}		}

// WARNING: this logic must be kept in sync with
// Instruction::isIdenticalToWhenDefined()!
bool llvm::EliminateDuplicatePHINodes(BasicBlock *BB) {		bool llvm::EliminateDuplicatePHINodes(BasicBlock *BB) {
// This implementation doesn't currently consider undef operands		// This implementation doesn't currently consider undef operands
// specially. Theoretically, two phis which are identical except for		// specially. Theoretically, two phis which are identical except for
// one having an undef where the other doesn't could be collapsed.		// one having an undef where the other doesn't could be collapsed.

struct PHIDenseMapInfo {		bool Changed = false;
static PHINode *getEmptyKey() {
return DenseMapInfo<PHINode *>::getEmptyKey();
}

static PHINode *getTombstoneKey() {
return DenseMapInfo<PHINode *>::getTombstoneKey();
}

static bool isSentinel(PHINode *PN) {
return PN == getEmptyKey() \|\| PN == getTombstoneKey();
}

static unsigned getHashValueImpl(PHINode *PN) {
// Compute a hash value on the operands. Instcombine will likely have
// sorted them, which helps expose duplicates, but we have to check all
// the operands to be safe in case instcombine hasn't run.
return static_cast<unsigned>(hash_combine(
hash_combine_range(PN->value_op_begin(), PN->value_op_end()),
hash_combine_range(PN->block_begin(), PN->block_end())));
}

static unsigned getHashValue(PHINode *PN) {
#ifndef NDEBUG
// If -phicse-debug-hash was specified, return a constant -- this
// will force all hashing to collide, so we'll exhaustively search
// the table for a match, and the assertion in isEqual will fire if
// there's a bug causing equal keys to hash differently.
if (PHICSEDebugHash)
return 0;
#endif
return getHashValueImpl(PN);
}

static bool isEqualImpl(PHINode LHS, PHINode RHS) {
if (isSentinel(LHS) \|\| isSentinel(RHS))
return LHS == RHS;
return LHS->isIdenticalTo(RHS);
}

static bool isEqual(PHINode LHS, PHINode RHS) {
// These comparisons are nontrivial, so assert that equality implies
// hash equality (DenseMap demands this as an invariant).
bool Result = isEqualImpl(LHS, RHS);
assert(!Result \|\| (isSentinel(LHS) && LHS == RHS) \|\|
getHashValueImpl(LHS) == getHashValueImpl(RHS));
return Result;
}
};

// Set of unique PHINodes.
DenseSet<PHINode *, PHIDenseMapInfo> PHISet;

// Examine each PHI.		// Examine each PHI.
bool Changed = false;		// Note that increment of I must NOT be in the iteration_expression, since
for (auto I = BB->begin(); PHINode *PN = dyn_cast<PHINode>(I++);) {		// we don't want to immediately advance when we restart from the beginning.
auto Inserted = PHISet.insert(PN);		for (auto I = BB->begin(); PHINode *PN = dyn_cast<PHINode>(I);) {
if (!Inserted.second) {		++I;
// A duplicate. Replace this PHI with its duplicate.		// Is there an identical PHI node in this basic block?
		// Note that we only look in the upper square's triangle,
		// we already checked that the lower triangle PHI's aren't identical.
		for (auto J = I; PHINode *DuplicatePN = dyn_cast<PHINode>(J); ++J) {
		if (!DuplicatePN->isIdenticalToWhenDefined(PN))
		continue;
		// A duplicate. Replace this PHI with the base PHI.
++NumPHICSEs;		++NumPHICSEs;
PN->replaceAllUsesWith(*Inserted.first);		DuplicatePN->replaceAllUsesWith(PN);
PN->eraseFromParent();		DuplicatePN->eraseFromParent();
Changed = true;		Changed = true;

// The RAUW can change PHIs that we already visited. Start over from the		// The RAUW can change PHIs that we already visited.
// beginning.
PHISet.clear();
I = BB->begin();		I = BB->begin();
		break; // Start over from the beginning.
}		}
}		}

return Changed;		return Changed;
}		}

/// enforceKnownAlignment - If the specified pointer points to an object that		/// enforceKnownAlignment - If the specified pointer points to an object that
/// we control, modify the object's alignment to PrefAlign. This isn't		/// we control, modify the object's alignment to PrefAlign. This isn't
/// often possible though. If alignment is important, a more reliable approach		/// often possible though. If alignment is important, a more reliable approach
/// is to simply align all global variables and allocation instructions to		/// is to simply align all global variables and allocation instructions to
/// their preferred alignment from the beginning.		/// their preferred alignment from the beginning.
▲ Show 20 Lines • Show All 1,889 Lines • Show Last 20 Lines