This is an archive of the discontinued LLVM Phabricator instance.

[MachinePipeliner] Refine the RecMII calculation
ClosedPublic

Authored by lsaba on Mar 10 2020, 6:21 AM.

Details

Summary

In the case of more than one SDep between two successor SUnits in the Nodeset, the current implementation sums the latencies of the dependencies, which could create a larger RecMII than necessary.

for example, in case there is both a data dependency and an output dependency (with latency > 0) between successor nodes:
SU(1) inst1:

successors:
  SU(2): out  latency = 1
  SU(2): data latency = 1

SU(2) inst2:

successors:
  SU(3): out  latency = 1
  SU(3): data latency = 1

SU(3) inst3:

successors:
  SU(1): out  latency = 1
  SU(1): data latency = 1

the NodeSet latency returned would be 6, whereas it could be 3 if we take the max for each successor SUnit.
In general this can be extended to finding the shortest path in the recurrence..
thoughts?

Unfortunately I had a hard time creating a test for this in Hexagon/PowerPC, so help would be appreciated.

Diff Detail

Event Timeline

lsaba created this revision.Mar 10 2020, 6:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2020, 6:21 AM

Hi @lsaba - this fix looks good to me. Thanks! It will be challenging to create a test case on Hexagon due to the output dependences. How to do get so many output dependences on your target?

Hi @lsaba - this fix looks good to me. Thanks! It will be challenging to create a test case on Hexagon due to the output dependences. How to do get so many output dependences on your target?

re
Thanks for the fast reply!
we have an implicit physical register that certain instructions define, thus having several instructions in the code would create these output chains, the most similar to this that I found in Hexagon was the usr_ovf register, but I also saw that the Output Deps for these registers get removed in HexagonSubtarget::UsrOverflowMutation::apply.

(In powerPC there's a carry register, but the Output latency is 0)

I think stores to physical registers would also create such output chains?

This calculation could also happen with Data dependencies if there is more than one def,use chain between two instructions (for example multiple output instructions), are those present in Hexagon?

Hi @lsaba - this fix looks good to me. Thanks! It will be challenging to create a test case on Hexagon due to the output dependences. How to do get so many output dependences on your target?

re
Thanks for the fast reply!
we have an implicit physical register that certain instructions define, thus having several instructions in the code would create these output chains, the most similar to this that I found in Hexagon was the usr_ovf register, but I also saw that the Output Deps for these registers get removed in HexagonSubtarget::UsrOverflowMutation::apply.

(In powerPC there's a carry register, but the Output latency is 0)

I think stores to physical registers would also create such output chains?

This calculation could also happen with Data dependencies if there is more than one def,use chain between two instructions (for example multiple output instructions), are those present in Hexagon?

Yep, stores to physical registers will create output dependences, which can be done using inline assembly. At least, most of the code in the pipeliner that deals with output dependences is exercised using inline asm code.

The only Hexagon instructions with two outputs that the compiler generates are the post-increment load instructions, and it would be difficult to create a chain of these. There are some esoteric instructions, but they are typically generated through intrinsics only. I can try to create a test for this...

Yep, stores to physical registers will create output dependences, which can be done using inline assembly. At least, most of the code in the pipeliner that deals with output dependences is exercised using inline asm code.

The only Hexagon instructions with two outputs that the compiler generates are the post-increment load instructions, and it would be difficult to create a chain of these. There are some esoteric instructions, but they are typically generated through intrinsics only. I can try to create a test for this...

That would be very much appreciated!

Unfortunately, I've had no luck producing a test case that generates this type of graph.

lsaba added a comment.Apr 1 2020, 12:35 AM

Unfortunately, I've had no luck producing a test case that generates this type of graph.

Right, thanks for trying!! how do you suggest we proceed with this patch ?

bcahoon accepted this revision.Apr 3 2020, 11:43 AM

I've tested this patch with our version of LLVM and it passes our tests. I'm ok with adding the patch as-is, as long as no one else would object.

This revision is now accepted and ready to land.Apr 3 2020, 11:43 AM
lsaba added a comment.Apr 4 2020, 11:15 PM

I've tested this patch with our version of LLVM and it passes our tests. I'm ok with adding the patch as-is, as long as no one else would object.

great, thanks! I'm still having Github commit permission issues so I would appreciate if someone can submit it on behalf of me

jsji added a comment.Apr 13 2020, 11:35 AM

I will commit for you.

This revision was automatically updated to reflect the committed changes.

I will commit for you.

Thank you very much!