This is an archive of the discontinued LLVM Phabricator instance.

[llvm-mca][BtVer2] Teach how to identify dependency-breaking idioms.
ClosedPublic

Authored by andreadb on Jul 13 2018, 11:19 AM.

Details

Summary

This patch teaches llvm-mca how to identify dependency breaking instructions on btver2.

An example of dependency breaking instruction is the zero-idiom XOR (example: XOR %eax, %eax), which always generates zero regardless of the actual input register value.
Dependency breaking instructions don't have to wait on their input register operands before executing. This is because the result of the execution is known in advance (i.e. it is not dependent on the actual value of the input register).

Not all dependency breaking idioms are also zero-latency instructions. For example, CMPEQ %xmm1, %xmm1 sets %xmm1 to all-ones, and it is executed by a pipeline.

This patch adds a new method named isDependencyBreaking() to the MCInstrAnalysis interface. That method takes as input an instruction (i.e. MCInst) and a MCSubtargetInfo.
The default implementation of isDependencyBreaking() conservatively returns false for all instructions. Targets may override the default behavior, and return a value which better matches the subtarget processor behavior.

BtVer2 tests that have been affected by this change now return the expected number of instructions per cycle.

Please let me know if okay to commit.

-Andrea

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb created this revision.Jul 13 2018, 11:19 AM
mattd added a comment.Jul 13 2018, 3:35 PM

This change LGTM, but you should let a few others take a peek before landing this.

include/llvm/MC/MCInstrAnalysis.h
90 ↗(On Diff #155435)

Suggestion: It might be useful to reference Agner Fog's microarchitecture.pdf 17.9 "Breaking dependence," similar to how you reference another part of that doc in D49196. Although, I think your description is clearer than 17.9, but just a thought. A reader might find other parts of 17.x useful, such as 17.8.

andreadb added inline comments.Jul 14 2018, 5:42 AM
include/llvm/MC/MCInstrAnalysis.h
90 ↗(On Diff #155435)

Good point. I will add a reference to the Agner Fog's "microarchitecture" to the definition of isDependencyBreaking() in X86MCTargetDesc.cpp. I prefer to keep this comment simple if possible.

Can you add a comment to the end of the btver sched profile, that X86MCInstrAnalysis::isDependencyBreaking()
is also responsible (as in, needs to be modified) in detection of dep-breaking patterns?

include/llvm/MC/MCInstrAnalysis.h
93 ↗(On Diff #155435)

s/know/known/?

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
317 ↗(On Diff #155435)

Hmm. Hardcoding.
Are there plans on somehow generalizing this, maybe using sched profiles?

Can you add a comment to the end of the btver sched profile, that X86MCInstrAnalysis::isDependencyBreaking()
is also responsible (as in, needs to be modified) in detection of dep-breaking patterns?

If the goal is to document this behavior, a comment in the class MCInstrAnalysis should be enough.
If the goal is to provide guidelines for people that want to use llvm-mca, then we should have all this features documented in a .rst file.
In the next future (realistically by the end of august), with the help of Matt, there is a plan improve the llvm-mca documentation. We would describe all the changes that would help improving the analysis in llvm-mca.

include/llvm/MC/MCInstrAnalysis.h
93 ↗(On Diff #155435)

Thanks. I will fix it.

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
317 ↗(On Diff #155435)

There is no plan. I have experimented with alternative approaches.
The 'isDependencyBreaking' could be generated by tablegen for scheduling models. However, it would require a non trivial change in the tablegen backends. In all honestly, I don't think it is worthy to do it just for the sake of better supporting dep-breaking idioms. In my opinion, this patch is simple and less invasive than any other solution involving tablegen.
If we use a tablegen approach, then the only advantage is that we generate nicer checks against CPUID, instead of forcing string compares (CPU names). The rest of the auto-generated code would be pretty much identical (in my experience, a bit less readable). However, we would end up paying a non negligible design cost for a feature that is not critical imho.
So, I don't think that there are better approaches (at least for now).

Can you add a comment to the end of the btver sched profile, that X86MCInstrAnalysis::isDependencyBreaking()
is also responsible (as in, needs to be modified) in detection of dep-breaking patterns?

If the goal is to document this behavior, a comment in the class MCInstrAnalysis should be enough.
If the goal is to provide guidelines for people that want to use llvm-mca, then we should have all this features documented in a .rst file.

The goal is to spare time of whoever will be adding some new scheduling model by explaining that the X86MCInstrAnalysis::isDependencyBreaking() needs to be modified too, for mca to be able to properly handle these dependency-breaking idioms.

In the next future (realistically by the end of august), with the help of Matt, there is a plan improve the llvm-mca documentation. We would describe all the changes that would help improving the analysis in llvm-mca.

andreadb updated this revision to Diff 155646.Jul 16 2018, 5:05 AM

Patch updated.

Added a comment to X86MCInstrAnalysis::isDependencyBreaking.

@craig.topper do you think this approach is acceptable?

RKSimon added inline comments.Jul 17 2018, 3:38 AM
include/llvm/MC/MCInstrAnalysis.h
97 ↗(On Diff #155646)

I don't know if this description is too x86 specific or not - AFAICT the actual 'same register' test is inside X86MCInstrAnalysis::isDependencyBreaking so it isn't a generic requirement?

An alternative approach could be for MCInstrAnalysis::isDependencyBreaking to return true if the instruction is independent of some/all of its input operands and a APInt& Mask (or similar) is used to list the operands that it doesn't depend upon - not sure if we need all that functionality or not though so it might be over-engineering it.

Also, do you know if we can make use of this in the machine schedulers or whatever? I'm not against it being inside MCInstrAnalysis in general although the fact that its so cpu-target specific suggests it could get bulky. @craig.topper What do you think?

andreadb added inline comments.Jul 17 2018, 4:04 AM
include/llvm/MC/MCInstrAnalysis.h
97 ↗(On Diff #155646)

I am okay with either approach. We can use an APInt to identify register reads that can be bypassed.

If we want to use this knowledge in the machine schedulers, then we can add a hook to the InstructionInfo interface. Each target can then describe dependency breaking instructions using a similar logic.
However, that new hook would take a MachineInstr in input, and not a MCInst.
So, we still have to expose that information to MC.

We could use the new scheduling predicates to construct a TII predicate. That predicate is translated by tablegen into a method to the TII interface. We can then expand that same predicate into a method in MCInstrAnalysis. That should be doable. I wanted to avoid this scenario. However, if the plan is to reuse this knowledge elsewhere, then I can revisit this patch.

lebedev.ri added inline comments.Jul 20 2018, 7:13 AM
test/tools/llvm-mca/X86/BtVer2/one-idioms.s
28 ↗(On Diff #155646)

You probably wanted to drop this comment.

tools/llvm-mca/DispatchStage.cpp
110 ↗(On Diff #155646)

So it will no longer even consider the sched profile?

andreadb added inline comments.Jul 20 2018, 7:43 AM
test/tools/llvm-mca/X86/BtVer2/one-idioms.s
28 ↗(On Diff #155646)

I will remove it.

tools/llvm-mca/DispatchStage.cpp
110 ↗(On Diff #155646)

Not sure I understand the question. I am definitely using profile information from the scheduling model.

lebedev.ri added inline comments.Jul 20 2018, 7:45 AM
tools/llvm-mca/DispatchStage.cpp
110 ↗(On Diff #155646)

What i'm asking is - what happens if sched model says that instruction N has zero latency, but isDependencyBreaking() does not say that?

andreadb added inline comments.Jul 20 2018, 8:23 AM
tools/llvm-mca/DispatchStage.cpp
110 ↗(On Diff #155646)

In that case, instruction N will have to wait in the scheduler until input registers are all available. Then it is executed.

andreadb added inline comments.Jul 20 2018, 8:40 AM
tools/llvm-mca/DispatchStage.cpp
110 ↗(On Diff #155646)

If it doesn’t work like that, then there is a bug. I cannot test it at the moment as I am not at work. I will check it on next days. Thanks.

lebedev.ri added inline comments.Jul 20 2018, 8:41 AM
tools/llvm-mca/DispatchStage.cpp
110 ↗(On Diff #155646)

So the scheduler-profile-aware instruction scheduler will schedule it as-if it has zero latency,
but the mca-based analysis will not?
And how would one go about detecting such inconsistencies?
This all makes me uneasy.

andreadb added inline comments.Jul 20 2018, 8:53 AM
tools/llvm-mca/DispatchStage.cpp
110 ↗(On Diff #155646)

No. I never said that.
It would still execute zero cycles. It simply has to wait for the operands.

Since I didn’t specifically test that bogus scenario, I will check that it really behaves like that. I cannot do that test now because I am not at work. I will be back in a few days.

andreadb updated this revision to Diff 157958.Jul 30 2018, 7:26 AM

Patch rebased and updated.

To answe to Roman's questions:
I verified that the patch works as intended for zero latency instructions that are not marked as "dependency breaking".
If a zero-latency instruction reads a register, then it has to wait on the input. If it is marked as dep-breaking, then it can issue immediately.
If a zero-latency is not marked dep-breaking, and the inputs are all ready, then it can be issued immediately.

As I wrote in a previous comment, we can have a follow-up patch where we teach to tablegen how to generate the body of isDependencyBreaking predicate for us (and potentially expose that same logic to TargetInstrInfo).

Couple of final minor comments

include/llvm/MC/MCInstrAnalysis.h
97 ↗(On Diff #155646)

OK - we don't have a use for this yet but please can you add a TODO suggesting that per-operand dependency breaking might be useful in the future.

Also, please update the comment as "if input register operands are all the same register." isn't mandatory for an instruction to be dependency breaking (move that part of the comment into the X86 XOR example) - for instance XOP VPCOM instructions ignores the input registers if the compare mode (immediate value) is TRUE or FALSE.

andreadb updated this revision to Diff 158203.Jul 31 2018, 3:43 AM

Patch updated.

Addressed review comments.

RKSimon accepted this revision.Jul 31 2018, 4:08 AM

LGTM - thanks

This revision is now accepted and ready to land.Jul 31 2018, 4:08 AM
This revision was automatically updated to reflect the committed changes.