This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] Add up latencies of all instructions in new pattern.
ClosedPublic

Authored by fhahn on Nov 21 2017, 7:53 AM.

Download Raw Diff

Details

Reviewers

Gerolf
sebpop
spop
fhahn

Commits

rG001c3dd202b6: [MachineCombiner] Add up latencies of all instructions in new pattern.
rL319951: [MachineCombiner] Add up latencies of all instructions in new pattern.

Summary

When calculating the RootLatency, we add up all the latencies of the
deleted instructions. But for NewRootLatency we only add the latency of
the new root instructions, ignoring the latencies of the other
instructions inserted. This leads the combiner to underestimate the cost
of patterns which add multiple instructions. This patch fixes that by
summing up the latencies of all new instructions. For NewRootNode, the
more complex getLatency function is used.

Note that we may be slightly more precise than just summing up
all latencies. For example, consider a pattern like

r1 = INS1 ..
r2 = INS2 ..
r3 = INS3 r1, r2

I think in some other places, the total latency of the pattern would be
estimated as lat(INS3) + max(lat(INS1), lat(INS2)). If you consider
that worth changing, I think it would be best to do in a follow-up
patch.

Diff Detail

Event Timeline

I ran into this being a problem in D40306. The patterns added there are are not profitable on Cortex-A57, but because the NewRootLatency is under-estimated, the unprofitable pattern gets applied.

LGTM

Thanks Evandro for having a look! I would appreciate any feedback from Gerolf or Sebastian, as they reviewed/added the change summing up the latencies of the deleted instructions.

ping

I think your commit makes accounting for both cases - the instructions inserted and the instructions deleted - consistent. From that angle this look OK. I have also some food for thought: in both cases the code makes the assumption that it inserts a dependent instruction chain. Only then it is correct to simply add instruction latencies. I have not checked that this assumption is correct for all cases. So I suggest to add a comment about this. Assuming also that you don't see any notable perf regression you can go ahead and commit from my perspective: LGTM.

Thanks for working on this and your patience!
Cheers
Gerolf

Thanks for having a look. There are no perf regressions on Cortex-A57 on SPEC2k, the lnt test suite and SPEC2006. I'll commit this now, but will also run SPEC2017 on another board.

I've added a comment about the dependency chains. I tried to cover this issue in the original review description, but the dependent instruction chains wording is more concise.

Accept in Phabricator after LGTM by Gerolf and Evandro, to make Arc happy.

This revision is now accepted and ready to land.Dec 6 2017, 9:27 AM

fhahn closed this revision.Dec 6 2017, 12:28 PM

Revision Contents

Path

Size

lib/

CodeGen/

MachineCombiner.cpp

11 lines

Diff 125750

lib/CodeGen/MachineCombiner.cpp

Show First 20 Lines • Show All 276 Lines • ▼ Show 20 Lines	bool MachineCombiner::improvesCriticalPathLen(
// machine trace metrics and CPU models.		// machine trace metrics and CPU models.
if (getCombinerObjective(Pattern) == CombinerObjective::MustReduceDepth)		if (getCombinerObjective(Pattern) == CombinerObjective::MustReduceDepth)
return NewRootDepth < RootDepth;		return NewRootDepth < RootDepth;

// A more flexible cost calculation for the critical path includes the slack		// A more flexible cost calculation for the critical path includes the slack
// of the original code sequence. This may allow the transform to proceed		// of the original code sequence. This may allow the transform to proceed
// even if the instruction depths (data dependency cycles) become worse.		// even if the instruction depths (data dependency cycles) become worse.

unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);		// Account for the latency of the inserted and deleted instructions by
unsigned RootLatency = 0;		// adding up their latencies. This assumes that the inserted and deleted
		// instructions are dependent instruction chains, which might not hold
		// in all cases.
		unsigned NewRootLatency = 0;
		for (unsigned i = 0; i < InsInstrs.size() - 1; i++)
		NewRootLatency += TSchedModel.computeInstrLatency(InsInstrs[i]);
		NewRootLatency += getLatency(Root, NewRoot, BlockTrace);

		unsigned RootLatency = 0;
for (auto I : DelInstrs)		for (auto I : DelInstrs)
RootLatency += TSchedModel.computeInstrLatency(I);		RootLatency += TSchedModel.computeInstrLatency(I);

unsigned RootSlack = BlockTrace.getInstrSlack(*Root);		unsigned RootSlack = BlockTrace.getInstrSlack(*Root);
unsigned NewCycleCount = NewRootDepth + NewRootLatency;		unsigned NewCycleCount = NewRootDepth + NewRootLatency;
unsigned OldCycleCount = RootDepth + RootLatency +		unsigned OldCycleCount = RootDepth + RootLatency +
(SlackIsAccurate ? RootSlack : 0);		(SlackIsAccurate ? RootSlack : 0);
DEBUG(dbgs() << " NewRootLatency: " << NewRootLatency << "\n";		DEBUG(dbgs() << " NewRootLatency: " << NewRootLatency << "\n";
▲ Show 20 Lines • Show All 265 Lines • Show Last 20 Lines