This is an archive of the discontinued LLVM Phabricator instance.

[MachinePipeliner] Refine the RecMII calculation
ClosedPublic

Authored by lsaba on Mar 10 2020, 6:21 AM.

Download Raw Diff

Details

Reviewers

bcahoon
kparzysz
jsji

Commits

rG5c7bbe3659a0: [MachinePipeliner] Refine the RecMII calculation

Summary

In the case of more than one SDep between two successor SUnits in the Nodeset, the current implementation sums the latencies of the dependencies, which could create a larger RecMII than necessary.

for example, in case there is both a data dependency and an output dependency (with latency > 0) between successor nodes:
SU(1) inst1:

successors:
  SU(2): out  latency = 1
  SU(2): data latency = 1

SU(2) inst2:

successors:
  SU(3): out  latency = 1
  SU(3): data latency = 1

SU(3) inst3:

successors:
  SU(1): out  latency = 1
  SU(1): data latency = 1

the NodeSet latency returned would be 6, whereas it could be 3 if we take the max for each successor SUnit.
In general this can be extended to finding the shortest path in the recurrence..
thoughts?

Unfortunately I had a hard time creating a test for this in Hexagon/PowerPC, so help would be appreciated.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lsaba created this revision.Mar 10 2020, 6:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2020, 6:21 AM

Herald added a subscriber: steven.zhang. · View Herald Transcript

Harbormaster failed remote builds in B48678: Diff 249349!Mar 10 2020, 6:59 AM

Hi @lsaba - this fix looks good to me. Thanks! It will be challenging to create a test case on Hexagon due to the output dependences. How to do get so many output dependences on your target?

In D75918#1916156, @bcahoon wrote:

Hi @lsaba - this fix looks good to me. Thanks! It will be challenging to create a test case on Hexagon due to the output dependences. How to do get so many output dependences on your target?

re
Thanks for the fast reply!
we have an implicit physical register that certain instructions define, thus having several instructions in the code would create these output chains, the most similar to this that I found in Hexagon was the usr_ovf register, but I also saw that the Output Deps for these registers get removed in HexagonSubtarget::UsrOverflowMutation::apply.

(In powerPC there's a carry register, but the Output latency is 0)

I think stores to physical registers would also create such output chains?

This calculation could also happen with Data dependencies if there is more than one def,use chain between two instructions (for example multiple output instructions), are those present in Hexagon?

In D75918#1916326, @lsaba wrote:

In D75918#1916156, @bcahoon wrote:

Hi @lsaba - this fix looks good to me. Thanks! It will be challenging to create a test case on Hexagon due to the output dependences. How to do get so many output dependences on your target?

re
Thanks for the fast reply!
we have an implicit physical register that certain instructions define, thus having several instructions in the code would create these output chains, the most similar to this that I found in Hexagon was the usr_ovf register, but I also saw that the Output Deps for these registers get removed in HexagonSubtarget::UsrOverflowMutation::apply.

(In powerPC there's a carry register, but the Output latency is 0)

I think stores to physical registers would also create such output chains?

This calculation could also happen with Data dependencies if there is more than one def,use chain between two instructions (for example multiple output instructions), are those present in Hexagon?

Yep, stores to physical registers will create output dependences, which can be done using inline assembly. At least, most of the code in the pipeliner that deals with output dependences is exercised using inline asm code.

The only Hexagon instructions with two outputs that the compiler generates are the post-increment load instructions, and it would be difficult to create a chain of these. There are some esoteric instructions, but they are typically generated through intrinsics only. I can try to create a test for this...

In D75918#1917282, @bcahoon wrote:

Yep, stores to physical registers will create output dependences, which can be done using inline assembly. At least, most of the code in the pipeliner that deals with output dependences is exercised using inline asm code.

The only Hexagon instructions with two outputs that the compiler generates are the post-increment load instructions, and it would be difficult to create a chain of these. There are some esoteric instructions, but they are typically generated through intrinsics only. I can try to create a test for this...

That would be very much appreciated!

ping

Unfortunately, I've had no luck producing a test case that generates this type of graph.

In D75918#1953795, @bcahoon wrote:

Unfortunately, I've had no luck producing a test case that generates this type of graph.

Right, thanks for trying!! how do you suggest we proceed with this patch ?

I've tested this patch with our version of LLVM and it passes our tests. I'm ok with adding the patch as-is, as long as no one else would object.

This revision is now accepted and ready to land.Apr 3 2020, 11:43 AM

In D75918#1960302, @bcahoon wrote:

I've tested this patch with our version of LLVM and it passes our tests. I'm ok with adding the patch as-is, as long as no one else would object.

great, thanks! I'm still having Github commit permission issues so I would appreciate if someone can submit it on behalf of me

ping

I will commit for you.

Closed by commit rG5c7bbe3659a0: [MachinePipeliner] Refine the RecMII calculation (authored by lsaba, committed by jsji). · Explain WhyApr 13 2020, 12:27 PM

This revision was automatically updated to reflect the committed changes.

In D75918#1978445, @jsji wrote:

I will commit for you.

Thank you very much!

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

MachinePipeliner.h

20 lines

Diff 257064

llvm/include/llvm/CodeGen/MachinePipeliner.h

Show First 20 Lines • Show All 324 Lines • ▼ Show 20 Lines	class NodeSet {
unsigned Latency = 0;		unsigned Latency = 0;

public:		public:
using iterator = SetVector<SUnit *>::const_iterator;		using iterator = SetVector<SUnit *>::const_iterator;

NodeSet() = default;		NodeSet() = default;
NodeSet(iterator S, iterator E) : Nodes(S, E), HasRecurrence(true) {		NodeSet(iterator S, iterator E) : Nodes(S, E), HasRecurrence(true) {
Latency = 0;		Latency = 0;
for (unsigned i = 0, e = Nodes.size(); i < e; ++i)		for (unsigned i = 0, e = Nodes.size(); i < e; ++i) {
for (const SDep &Succ : Nodes[i]->Succs)		DenseMap<SUnit *, unsigned> SuccSUnitLatency;
if (Nodes.count(Succ.getSUnit()))		for (const SDep &Succ : Nodes[i]->Succs) {
Latency += Succ.getLatency();		auto SuccSUnit = Succ.getSUnit();
		if (!Nodes.count(SuccSUnit))
		continue;
		unsigned CurLatency = Succ.getLatency();
		unsigned MaxLatency = 0;
		if (SuccSUnitLatency.count(SuccSUnit))
		MaxLatency = SuccSUnitLatency[SuccSUnit];
		if (CurLatency > MaxLatency)
		SuccSUnitLatency[SuccSUnit] = CurLatency;
		}
		for (auto SUnitLatency : SuccSUnitLatency)
		Latency += SUnitLatency.second;
		}
}		}

bool insert(SUnit *SU) { return Nodes.insert(SU); }		bool insert(SUnit *SU) { return Nodes.insert(SU); }

void insert(iterator S, iterator E) { Nodes.insert(S, E); }		void insert(iterator S, iterator E) { Nodes.insert(S, E); }

template <typename UnaryPredicate> bool remove_if(UnaryPredicate P) {		template <typename UnaryPredicate> bool remove_if(UnaryPredicate P) {
return Nodes.remove_if(P);		return Nodes.remove_if(P);
▲ Show 20 Lines • Show All 248 Lines • Show Last 20 Lines