This is an archive of the discontinued LLVM Phabricator instance.

[MacroFusion] Create the missing artificial edges if there are more than 2 SU fused.
AbandonedPublic

Authored by steven.zhang on Nov 8 2019, 2:16 AM.

Details

Reviewers
jsji
nemanjai
hfinkel
fhahn
evandro
MatzeB
arsenm
Group Reviewers
Restricted Project
Summary

For now, llvm MacroFusion would fuse the adjacent instructions no matter if it has been fused before. However, we miss to create some edges that cause problem.

Assume that we have the code:

int foo(int a, int b, int c, int d) {
  return a + b + c +d;
}

And ADD and ADD are a fusion pair. And this is the Dependency graph.

+------+       +------+       +------+       +------+
|  A   |       |  B   |       |  C   |       |  D   |
+--+--++       +---+--+       +--+---+       +--+---+
   ^  ^            ^  ^          ^              ^
   |  |            |  |          |              |
   |  |            |  |New1      +--------------+
   |  |            |  |          |
   |  |            |  |       +--+---+
   |  |New2        |  +-------+ ADD1 |
   |  |            |          +--+---+
   |  |            |    Fuse     ^
   |  |            +-------------+
   |  +------------+
   |               |
   |   Fuse     +--+---+
   +----------->+ ADD2 |
   |            +------+
+--+---+
| ADD3 |
+------+

When ADD1 and ADD2 are fused, we will create an artificial edge New1 to make sure that, B is scheduled before ADD1. And when ADD3 and ADD2 are fused,
another artificial edge New2 is created to make sure that, A is scheduled before ADD2. However, this is NOT enough. We need to create another artificial edge from ADD1 to A to make sure that, A is scheduled before ADD1 also.

Diff Detail

Event Timeline

steven.zhang created this revision.Nov 8 2019, 2:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2019, 2:16 AM
jsji added a comment.Nov 8 2019, 8:23 PM

My understanding is that you are doing two fixes in this patch:

  1. Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.
  2. Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

llvm/include/llvm/CodeGen/MacroFusion.h
34

New parameters, should be documented in above comment as well.

llvm/test/CodeGen/AArch64/macro-fusion-verify.ll
4

This comment is confusing. I believe the goal of this patch is to avoid chained-fusion, hence reducing unnecessary dependency, so you would like to verify that there is no extra dependency?

25

Maybe we should check that we only fuse SU4 SU5, not SU5 SU6 too.

My understanding is that you are doing two fixes in this patch:

  1. Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.
  2. Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

qiucf added a subscriber: qiucf.Nov 10 2019, 7:29 PM
This comment was removed by qiucf.

My understanding is that you are doing two fixes in this patch:

  1. Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.
  2. Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

Load/store clustering should produce > 2 sections of clusters, but I don't remember the details of how the DAG mutation is implemented. Specifically for the MacroFusion mutation, I'm not sure. It may be useful to combine one def with multiple uses, but I'm not sure if that actually happens now.

My understanding is that you are doing two fixes in this patch:

  1. Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.
  2. Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

Load/store clustering should produce > 2 sections of clusters, but I don't remember the details of how the DAG mutation is implemented. Specifically for the MacroFusion mutation, I'm not sure. It may be useful to combine one def with multiple uses, but I'm not sure if that actually happens now.

Yeah, from my investigation, the MacroFusion implementation should support it. Do you know the AMDGPU hw supports more than 2 SU's macro fusion as the Load/Store cluster or just as other target, that is back-to-back. I guess it is also back-to-back, but I want to confirm it.

My understanding is that you are doing two fixes in this patch:

  1. Extend API of shouldScheduleAdjacent to avoid fusing an instruction more than once along the same dependency chain.
  2. Extend fuseInstructionPair to add artificial data dependencies for chained-fusion

To me, yes, we should do #1 first, as https://reviews.llvm.org/D36704 was already trying to do so.

Yes, but still have problems.

#2 might not be necessary, as you mentioned, all existing targets only support back-to-back fusion.
If we want to extend it for chained-fusion (3 or more in same dependency chain),
then I believe we need more changes than adding data dependencies.
Also need additional tests for some target that support it (maybe RISC-V?)

The current implementation has already been implemented to support more than 2 SU's fusion, However, it misses to create some dependency edges. The patch is trying to fix the bug of the macro fusion infrastructure. From my understanding,
adding these missing edges is enough. I didn't see llvm support the macro fusion for RISC-V target. AMDGPU supports more than 2 SU's cluster for load. @arsenm Would you please help me confirm if AMDGPU target supports more than 2 SU's fusion ?

Load/store clustering should produce > 2 sections of clusters, but I don't remember the details of how the DAG mutation is implemented. Specifically for the MacroFusion mutation, I'm not sure. It may be useful to combine one def with multiple uses, but I'm not sure if that actually happens now.

Yeah, from my investigation, the MacroFusion implementation should support it. Do you know the AMDGPU hw supports more than 2 SU's macro fusion as the Load/Store cluster or just as other target, that is back-to-back. I guess it is also back-to-back, but I want to confirm it.

Load/Store does benefit from multiple instructions back to back.

The MacroFusion doesn't need back to back scheduling. We just want the use of the condition register to follow the def because it usually means we can use a smaller instruction encoding. It doesn't need to be the next instruction, it's just helpful to avoid another condition register def between the two instructions.

I get it. Thank you!

I will split this patch into two in response with Jinsong's comments.

  1. Fix the missing edges.
  2. Extend the interface to allow the target to specify the max fuse SU number.
steven.zhang edited the summary of this revision. (Show Details)

This patch is to fix the missing edges. I have updated the patch.

steven.zhang added a comment.EditedNov 10 2019, 11:59 PM

https://reviews.llvm.org/D70066 is created to limit the max number of the fusion instr.

At first glance, this patch seems sensible, but I'm not sure that D70066 is necessary.

Gentle ping...

fhahn added a comment.Dec 4 2019, 3:37 AM

With rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba landed, is this still relevant?

With rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba landed, is this still relevant?

Yes, they are different problems. rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba is trying to limit the number of chained SU's as two, while this patch is to fix the problem if we want to chain more than two SU's, though it is limited to two now. But by the design, we should have it work well if someone want to relax the limit later. It is somewhat like, we have the ability to chain any number of SU's, but now, it is limited to two, instead of, we can only chain two SU's, and have bugs if chain more.

There won't be any compiling time impact for this patch if it is limited to two, as the pred/succ of CurrentSU is always null if only chain two SU's.

fhahn added a comment.Dec 5 2019, 1:31 PM

With rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba landed, is this still relevant?

Yes, they are different problems. rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba is trying to limit the number of chained SU's as two, while this patch is to fix the problem if we want to chain more than two SU's, though it is limited to two now. But by the design, we should have it work well if someone want to relax the limit later. It is somewhat like, we have the ability to chain any number of SU's, but now, it is limited to two, instead of, we can only chain two SU's, and have bugs if chain more.

Sure, but currently there can be no bug and the interface prevents that case from happening. I don't see why we would need to deal with a case that might happen in the future, if the interface changes. To me it seems like the time to fix that would be when the interface gets extended. Until then, we cannot test the patch. To support fusing more than pairs, I think it would be better to do this in a separate function and deal with those cases there, rather than unnecessarily complicating the code for pairs.

There won't be any compiling time impact for this patch if it is limited to two, as the pred/succ of CurrentSU is always null if only chain two SU's.

Hm I do not think that's true, we still need to check all the predecessors/successors of the SUs, which potentially can be a large number for bad inputs. The compile-time impact of this patch on its own might be quite small, but at least in degenerate cases it could be measurable (same for rGd84b320dfd0a7dbedacc287ede5e5bc4c0f113ba )

steven.zhang abandoned this revision.Dec 5 2019, 6:56 PM

Hmm, we are putting a bomb here if someone want to get extends :P But I agree with you that as we cannot test this patch now, it is NOT the best time to fix it.

fhahn added a comment.Dec 6 2019, 2:32 AM

Hmm, we are putting a bomb here if someone want to get extends :P But I agree with you that as we cannot test this patch now, it is NOT the best time to fix it.

One possible way to deal with that would be to add an assertion that we chain at most 2 instructions together here, with a comment what the issue will be with chains longer than 2 instructions.

Good suggestion, thank you! I will post a patch to remove that bomb.

nhaehnle removed a subscriber: nhaehnle.Dec 9 2019, 1:26 AM