This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] print max throughput for 0-latency insts
ClosedPublic

Authored by spatel on Jun 4 2018, 8:57 AM.

Details

Summary

This is a fix for the problem arising in D47374 (PR37678):
https://bugs.llvm.org/show_bug.cgi?id=37678

We don't have throughput info for instructions with variant scheduling, so assume that a zero latency inst can execute/complete at max-issue-width.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jun 4 2018, 8:57 AM

Hey Sanjay,

what if instead we change method MCSchedModel::getReciprocalThroughput?
If we decide to always return a meaningful value for instructions that don't consume processor resources, then I think that we can get rid of Optional<>.
If an instruction is associated with an invalid scheduling class (i.e. !SCDesc->isValid()), then we should just return 1/IssueWidth.
If an instruction doesn't consume resources, then we should return N/IssueWidth (where N is the NumMicroOpcodes).

I also prefer that we don't touch getSchedInfoStr(). My opinion is that adding extra logic to that method is going to make it difficult to get rid of it in the next future. Strictly speaking, the idea of adding that method to the subtarget was bad; there is no reason why the "profile string" should be returned by the subtarget!... My wish is that we get rid of getSchedInfoStr() (as well as function createSchedInfoStr) in the near future.

Not sure if this makes sense.

-Andrea

spatel added a comment.Jun 4 2018, 9:33 AM

Hey Sanjay,

what if instead we change method MCSchedModel::getReciprocalThroughput?
If we decide to always return a meaningful value for instructions that don't consume processor resources, then I think that we can get rid of Optional<>.
If an instruction is associated with an invalid scheduling class (i.e. !SCDesc->isValid()), then we should just return 1/IssueWidth.
If an instruction doesn't consume resources, then we should return N/IssueWidth (where N is the NumMicroOpcodes).

I also prefer that we don't touch getSchedInfoStr(). My opinion is that adding extra logic to that method is going to make it difficult to get rid of it in the next future. Strictly speaking, the idea of adding that method to the subtarget was bad; there is no reason why the "profile string" should be returned by the subtarget!... My wish is that we get rid of getSchedInfoStr() (as well as function createSchedInfoStr) in the near future.

Not sure if this makes sense.

Yes, I think that all makes sense. I actually looked at the higher-level change first to remove the Optional<>, but I wasn't sure if we wanted to do that or just go with the quick hack for now. I'll have a look at fixing this better.

spatel updated this revision to Diff 149845.Jun 4 2018, 1:45 PM
spatel edited the summary of this revision. (Show Details)

Patch updated:
Please have a close look at the diffs because I'm not sure how to test all of these paths.

The cases that I see behaving as intended via tests are the recent variant scheduled "xorps %xmm0, %xmm0" and existing non-zero latency cases like:
; ZNVER1-SSE-NEXT: ldmxcsr -{{[0-9]+}}(%rsp) # sched: [100:?]
(those cases are all unchanged, as I think is expected for this patch...but let me know if we want to treat that differently)

The big logical difference in this patch is that instead of using an Optional, we rely on FP NaN to represent the unknown state. Then, we modify the throughput calcs to account for the cases where no execution resources are specified by using default/max issue widths.

Patch updated:
Please have a close look at the diffs because I'm not sure how to test all of these paths.

The cases that I see behaving as intended via tests are the recent variant scheduled "xorps %xmm0, %xmm0" and existing non-zero latency cases like:
; ZNVER1-SSE-NEXT: ldmxcsr -{{[0-9]+}}(%rsp) # sched: [100:?]
(those cases are all unchanged, as I think is expected for this patch...but let me know if we want to treat that differently)

I think that IssueWidth/NumMicroOps should always be the upper bound for the throughput. So, I am of the opinion that we should not restrict this change to zero latency instructions.

The big logical difference in this patch is that instead of using an Optional, we rely on FP NaN to represent the unknown state. Then, we modify the throughput calcs to account for the cases where no execution resources are specified by using default/max issue widths.

I think it is easier to use 0.0 as the default "unknown" value.
The reciprocal throughput can never be 0.0; a RThrougput of zero effectively means that the throughput tends to infinite. This is an impossible scenario since we don't have infinite parallelism in hardware (i.e. we cannot model infinite resources), and we cannot execute an infinite number of instructions per cycle. Essentially, 0.0 is our asymptote.
So, it is safe to use 0.0 as the special "unknown" throughput. For instructions that don't consume resource cycles we would still return N/IssueWidth.
Also, you don't need to use the "unknown" value in any max computation (see comments below).

lib/CodeGen/TargetSchedule.cpp
336 ↗(On Diff #149845)

I would just return 0.0 instead of NaN.

351 ↗(On Diff #149845)

Same.

lib/CodeGen/TargetSubtargetInfo.cpp
74 ↗(On Diff #149845)

If we use 0.0 as the unknown value, then this would become:

IF (Rthroughput != 0.0)
lib/MC/MCSchedule.cpp
91–111 ↗(On Diff #149845)

My opinion is that we should not limit this to zero-latency instructions.

Here, you could have just reused the original code which computes the Throughput as a Optional<double>.

If Throughput is not defined, then simply return SCDesc.NumMicroOps / SM.IssueWidth.
Basically, you should be able to only change the return statement.

149 ↗(On Diff #149845)

You can simply return 1/DefaultIssueWidth.
This computation may not accurate; ideally, we should use the actual IssueWidth from the scheduling model. In practice, this code is not used: x86 models don't use itineraries, and - as far as I know - x86 is the only target that enables the -print-schedule functionality.
Fixing this would probably require a change to this API, which is probably outside of the scope of this patch. For now, I think that returning a default 1/DefaultIssueWidth is good enough.

spatel updated this revision to Diff 150031.Jun 5 2018, 1:21 PM
spatel marked 5 inline comments as done.

Patch updated:
Calculate throughput for non-zero-latency instructions with no throughput specified.
This is a smaller code change than the last rev, but as shown, it results in a large amount of test diffs.
I was trying to avoid this in previous versions of the patch, but I think it makes sense given that the code change is smaller if we jump directly to this step.

andreadb accepted this revision.Jun 5 2018, 2:22 PM

Thanks Sanjay.
It looks good to me.
I agree that this affects a lot of tests. However it looks like most of the diffs are for znver1, which apparently assigns an arbitrary 100cy latency , 1 uOp and no processor resource cycles to all microcoded instructions. So I am not particularly worried about it.
In the absence of extra/accurate schedule information, I think that computing an optimistic reciprocal throughput based on the number of opcodes and IssueWidth is the best that we can do.

This revision is now accepted and ready to land.Jun 5 2018, 2:22 PM
This revision was automatically updated to reflect the committed changes.