Index: llvm/trunk/docs/CommandGuide/llvm-mca.rst =================================================================== --- llvm/trunk/docs/CommandGuide/llvm-mca.rst +++ llvm/trunk/docs/CommandGuide/llvm-mca.rst @@ -305,9 +305,9 @@ cycles to the machine instruction in the sequence. For example, every iteration of the instruction vmulps always executes on resource unit [6] (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle -per iteration. Note that on Jaguar, vector floating-point multiply can only be -issued to pipeline JFPU1, while horizontal floating-point additions can only be -issued to pipeline JFPU0. +per iteration. Note that on AMD Jaguar, vector floating-point multiply can +only be issued to pipeline JFPU1, while horizontal floating-point additions can +only be issued to pipeline JFPU0. The resource pressure view helps with identifying bottlenecks caused by high usage of specific hardware resources. Situations with resource pressure mainly @@ -427,3 +427,125 @@ resources, the delta between the two counters is small. However, the number of cycles spent in the queue tends to be larger (i.e., more than 1-3cy), especially when compared to other low latency instructions. + +Extra Statistics to Further Diagnose Performance Issues +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The ``-all-stats`` command line option enables extra statistics and performance +counters for the dispatch logic, the reorder buffer, the retire control unit, +and the register file. + +Below is an example of ``-all-stats`` output generated by MCA for the +dot-product example discussed in the previous sections. + +.. code-block:: none + + Dynamic Dispatch Stall Cycles: + RAT - Register unavailable: 0 + RCU - Retire tokens unavailable: 0 + SCHEDQ - Scheduler full: 272 + LQ - Load queue full: 0 + SQ - Store queue full: 0 + GROUP - Static restrictions on the dispatch group: 0 + + + Dispatch Logic - number of cycles where we saw N instructions dispatched: + [# dispatched], [# cycles] + 0, 24 (3.9%) + 1, 272 (44.6%) + 2, 314 (51.5%) + + + Schedulers - number of cycles where we saw N instructions issued: + [# issued], [# cycles] + 0, 7 (1.1%) + 1, 306 (50.2%) + 2, 297 (48.7%) + + + Scheduler's queue usage: + JALU01, 0/20 + JFPU01, 18/18 + JLSAGU, 0/12 + + + Retire Control Unit - number of cycles where we saw N instructions retired: + [# retired], [# cycles] + 0, 109 (17.9%) + 1, 102 (16.7%) + 2, 399 (65.4%) + + + Register File statistics: + Total number of mappings created: 900 + Max number of mappings used: 35 + + * Register File #1 -- JFpuPRF: + Number of physical registers: 72 + Total number of mappings created: 900 + Max number of mappings used: 35 + + * Register File #2 -- JIntegerPRF: + Number of physical registers: 64 + Total number of mappings created: 0 + Max number of mappings used: 0 + +If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for +SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch +logic is unable to dispatch a group of two instructions because the scheduler's +queue is full. + +Looking at the *Dispatch Logic* table, we see that the pipeline was only able +to dispatch two instructions 51.5% of the time. The dispatch group was limited +to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The +dispatch statistics are displayed by either using the command option +``-all-stats`` or ``-dispatch-stats``. + +The next table, *Schedulers*, presents a histogram displaying a count, +representing the number of instructions issued on some number of cycles. In +this case, of the 610 simulated cycles, single +instructions were issued 306 times (50.2%) and there were 7 cycles where +no instructions were issued. + +The *Scheduler's queue usage* table shows that the maximum number of buffer +entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01 +reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements +three schedulers: + +* JALU01 - A scheduler for ALU instructions. +* JFPU01 - A scheduler floating point operations. +* JLSAGU - A scheduler for address generation. + +The dot-product is a kernel of three floating point instructions (a vector +multiply followed by two horizontal adds). That explains why only the floating +point scheduler appears to be used. + +A full scheduler queue is either caused by data dependency chains or by a +sub-optimal usage of hardware resources. Sometimes, resource pressure can be +mitigated by rewriting the kernel using different instructions that consume +different scheduler resources. Schedulers with a small queue are less resilient +to bottlenecks caused by the presence of long data dependencies. +The scheduler statistics are displayed by +using the command option ``-all-stats`` or ``-scheduler-stats``. + +The next table, *Retire Control Unit*, presents a histogram displaying a count, +representing the number of instructions retired on some number of cycles. In +this case, of the 610 simulated cycles, two instructions were retired during +the same cycle 399 times (65.4%) and there were 109 cycles where no +instructions were retired. The retire statistics are displayed by using the +command option ``-all-stats`` or ``-retire-stats``. + +The last table presented is *Register File statistics*. Each physical register +file (PRF) used by the pipeline is presented in this table. In the case of AMD +Jaguar, there are two register files, one for floating-point registers +(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of +the 900 instructions processed, there were 900 mappings created. Since this +dot-product example utilized only floating point registers, the JFPuPRF was +responsible for creating the 900 mappings. However, we see that the pipeline +only used a maximum of 35 of 72 available register slots at any given time. We +can conclude that the floating point PRF was the only register file used for +the example, and that it was never resource constrained. The register file +statistics are displayed by using the command option ``-all-stats`` or +``-register-file-stats``. + +In this example, we can conclude that the IPC is mostly limited by data +dependencies, and not by resource pressure.