This patch adds a new flag named `-bottleneck-analysis` to print out information about bottlenecks that affected the throughputthroughput bottlenecks.
MCA already knows how to identify and classify dynamic dispatch stalls. However, MCAit doesn't know how to identify sources of bottlenecks that eventually lead to dispatch stallsanalyze and highlight kernel bottlenecks.
The goal of this patch is to teach MCA how to correlate increases in backend pressure to backend stalls (and therefore, the loss of throughput).
Backend pressure increases because of contention on processor resources.
From a Scheduler point of view,From a Scheduler point of view, backend pressure is a function of the scheduler buffer usage (i.e. how the number of uOps in the scheduler buffers changes over time). backend pressure is a function of the number of uOps in the scheduler buffers.
We say that pressure in the Scheduler increases whBackend pressure increases (or decreases) when there is a mismatch between the number of opcodes issued to the underlying piplelines is less thandispatched, and the number of opcode dispatched to the Scheduler durings issued in the same cycle.
Since buffer resources are limited, monotonic increases in pressure would eventually lead to a dispatch stall. continuous increases in backend pressure would eventually leads to dispatch stalls. So, there is a strong correlation between dispatch stalls, and how backpressure changed over time.
This patch teaches how to identify situations where backend pressure increases caused by:due to:
- unavailability ofle pipeline resources.
- data dependencies.
Data dependency can delay opcodes and therefore increase their time in the scheduler bufferies may delay execution of instructions and therefore increase the time that uOps have to spend in the scheduler buffers. That often translates to an increase in backend pressure which may eventually lead to a bottleneck.
Resource pressure caused by the temporary unavailability ofContention on pipeline resources canmay also delay execution of opcodesinstructions, and thereforelead to a temporary increase in backend pressure.
Internally, the Scheduler classifies instructions into four sets:
- Executingbased on whether register/memory operands are available or not.
Instructions are moved to the Ready set if they don't have to wait on data dependencies. Every cycle the scheduler attempts to issue instructions from the Ready set to the underlying pipelines. Otherwise, it reports to the ExecuteStage a "potential" increase backend pressure caused by unavailable pipeline resourcesAn instruction is marked as "ready to execute" only if data dependencies are fully resolved.
The ExecuteStage notify a "backend pressure event" only if it sees that pressure in the buffers increased (i.e.Every cycle, the Scheduler attempts to execute all instructions that are ready to execute. If an instruction cannot execute because of unavailable pipeline resources, the number of opcodes issued was less than the number of opcodes dispatched during that cycle)then the Scheduler internally updates a `BusyResourceUnits` mask with the ID of each unavailable resource.
The Scheduler reports a "potential" increase in pressure because of data dependencies if the Ready set is empty`ExecuteStage` is responsible for tracking changes in backend pressure. If backend pressure increases during a cycle because of contention on pipeline resources, and some instructions in the Peding set could have been issued in the absence of data dependenciesthen `ExecuteStage` sends a "backend pressure" event to the listeners.
Instructions in the Pending state have the nice property of being dependent only on instructions that have already started executionThat event would contain information about instructions delayed by resource pressure, as well as the `BusyResourceUnits` mask.
Pressure events are collected over time and notified to the listeners by the ExecuteStage.
The SummaryView observes those events, and generates a "bottleneck report" only if increases in backend pressure eventually caused backend stallsNote that `ExecuteStage` also knows how to identify situations where backpressure increased because of delays introduced by data dependencies.
The `SummaryView` observes "backend pressure" events and prints out a "bottleneck report".
Example of bottleneck report:
Cycles with backend pressure increase [ 99.89% ]
Resource Pressure [ 0.00% ]
Data Dependencies: [ 99.89% ]
- Register Dependencies [ 0.00% ]
- Memory Dependencies [ 99.89% ]
A bottleneck report is printed out only if increases in backend pressure eventually caused backend stalls.
About the time complexity:
The average slowdown tends to be in the range of ~5-6%.
Time complexity is a (linear) function of in the number of instructions in the Scheduler::PendingSet.
For memory intensive kernels, the slowdown can be significant if flag `-noalias=false` is specified. In the worst case scenario I have observed a slowdown of ~30% when flag `-noalias=false` was specified.
We can definitely recover part of that slowdown if we optimize class LSUnit (by doing extra bookkeeping to speedup queries).
For nowThis new analysis is disabled by default, this new analysis is opt-in from llvm-mcaand it can be enabled via flag `-bottleneck-analysis`.
Users of MCA as a library can enable it by passing a flag tothe generation of pressure events through the constructor of ExecuteStage. For simplicity, users of the default pipeline can simply specify a new pipeline option.
This patch partially addresses https://bugs.llvm.org/show_bug.cgi?id=37494
A follow up patch will extend the "scheduler-stats" view to also print out:
- the most problematic register dependencies (top 3)
- the most problematic memory dependencies (top 3)
- instructions mostly affected by bottlenecks caused by pipeline pressures (top 3).
That change plus this patch should fully address PR37494.
Let me know if okay to commit.