Index: llvm/trunk/docs/CommandGuide/llvm-mca.rst =================================================================== --- llvm/trunk/docs/CommandGuide/llvm-mca.rst +++ llvm/trunk/docs/CommandGuide/llvm-mca.rst @@ -21,9 +21,9 @@ when run on the target, but also help with diagnosing potential performance issues. -Given an assembly code sequence, llvm-mca estimates the IPC (Instructions Per -Cycle), as well as hardware resource pressure. The analysis and reporting style -were inspired by the IACA tool from Intel. +Given an assembly code sequence, llvm-mca estimates the IPC, as well as +hardware resource pressure. The analysis and reporting style were inspired by +the IACA tool from Intel. :program:`llvm-mca` allows the usage of special code comments to mark regions of the assembly code to be analyzed. A comment starting with substring @@ -207,3 +207,223 @@ :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed to standard error, and the tool returns 1. +HOW MCA WORKS +------------- + +MCA takes assembly code as input. The assembly code is parsed into a sequence +of MCInst with the help of the existing LLVM target assembly parsers. The +parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate +a performance report. + +The Pipeline module simulates the execution of the machine code sequence in a +loop of iterations (default is 100). During this process, the pipeline collects +a number of execution related statistics. At the end of this process, the +pipeline generates and prints a report from the collected statistics. + +Here is an example of a performance report generated by MCA for a dot-product +of two packed float vectors of four elements. The analysis is conducted for +target x86, cpu btver2. The following result can be produced via the following +command using the example located at +``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: + +.. code-block:: bash + + $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s + +.. code-block:: none + + Iterations: 300 + Instructions: 900 + Total Cycles: 610 + Dispatch Width: 2 + IPC: 1.48 + Block RThroughput: 2.0 + + + Instruction Info: + [1]: #uOps + [2]: Latency + [3]: RThroughput + [4]: MayLoad + [5]: MayStore + [6]: HasSideEffects (U) + + [1] [2] [3] [4] [5] [6] Instructions: + 1 2 1.00 vmulps %xmm0, %xmm1, %xmm2 + 1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3 + 1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4 + + + Resources: + [0] - JALU0 + [1] - JALU1 + [2] - JDiv + [3] - JFPA + [4] - JFPM + [5] - JFPU0 + [6] - JFPU1 + [7] - JLAGU + [8] - JMul + [9] - JSAGU + [10] - JSTC + [11] - JVALU0 + [12] - JVALU1 + [13] - JVIMUL + + + Resource pressure per iteration: + [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] + - - - 2.00 1.00 2.00 1.00 - - - - - - - + + Resource pressure by instruction: + [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions: + - - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2 + - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3 + - - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4 + +According to this report, the dot-product kernel has been executed 300 times, +for a total of 900 dynamically executed instructions. + +The report is structured in three main sections. The first section collects a +few performance numbers; the goal of this section is to give a very quick +overview of the performance throughput. In this example, the two important +performance indicators are the predicted total number of cycles, and the +Instructions Per Cycle (IPC). IPC is probably the most important throughput +indicator. A big delta between the Dispatch Width and the computed IPC is an +indicator of potential performance issues. + +The second section of the report shows the latency and reciprocal +throughput of every instruction in the sequence. That section also reports +extra information related to the number of micro opcodes, and opcode properties +(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects'). + +The third section is the *Resource pressure view*. This view reports +the average number of resource cycles consumed every iteration by instructions +for every processor resource unit available on the target. Information is +structured in two tables. The first table reports the number of resource cycles +spent on average every iteration. The second table correlates the resource +cycles to the machine instruction in the sequence. For example, every iteration +of the instruction vmulps always executes on resource unit [6] +(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle +per iteration. Note that on Jaguar, vector floating-point multiply can only be +issued to pipeline JFPU1, while horizontal floating-point additions can only be +issued to pipeline JFPU0. + +The resource pressure view helps with identifying bottlenecks caused by high +usage of specific hardware resources. Situations with resource pressure mainly +concentrated on a few resources should, in general, be avoided. Ideally, +pressure should be uniformly distributed between multiple resources. + +Timeline View +^^^^^^^^^^^^^ +MCA's timeline view produces a detailed report of each instruction's state +transitions through an instruction pipeline. This view is enabled by the +command line option ``-timeline``. As instructions transition through the +various stages of the pipeline, their states are depicted in the view report. +These states are represented by the following characters: + +* D : Instruction dispatched. +* e : Instruction executing. +* E : Instruction executed. +* R : Instruction retired. +* = : Instruction already dispatched, waiting to be executed. +* \- : Instruction executed, waiting to be retired. + +Below is the timeline view for a subset of the dot-product example located in +``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by +MCA using the following command: + +.. code-block:: bash + + $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s + +.. code-block:: none + + Timeline view: + 012345 + Index 0123456789 + + [0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2 + [0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3 + [0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 + [1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 + [1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3 + [1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4 + [2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2 + [2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3 + [2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4 + + + Average Wait times (based on the timeline view): + [0]: Executions + [1]: Average time spent waiting in a scheduler's queue + [2]: Average time spent waiting in a scheduler's queue while ready + [3]: Average time elapsed from WB until retire stage + + [0] [1] [2] [3] + 0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2 + 1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3 + 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 + +The timeline view is interesting because it shows instruction state changes +during execution. It also gives an idea of how MCA processes instructions +executed on the target, and how their timing information might be calculated. + +The timeline view is structured in two tables. The first table shows +instructions changing state over time (measured in cycles); the second table +(named *Average Wait times*) reports useful timing statistics, which should +help diagnose performance bottlenecks caused by long data dependencies and +sub-optimal usage of hardware resources. + +An instruction in the timeline view is identified by a pair of indices, where +the first index identifies an iteration, and the second index is the +instruction index (i.e., where it appears in the code sequence). Since this +example was generated using 3 iterations: ``-iterations=3``, the iteration +indices range from 0-2 inclusively. + +Excluding the first and last column, the remaining columns are in cycles. +Cycles are numbered sequentially starting from 0. + +From the example output above, we know the following: + +* Instruction [1,0] was dispatched at cycle 1. +* Instruction [1,0] started executing at cycle 2. +* Instruction [1,0] reached the write back stage at cycle 4. +* Instruction [1,0] was retired at cycle 10. + +Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the +scheduler's queue for the operands to become available. By the time vmulps is +dispatched, operands are already available, and pipeline JFPU1 is ready to +serve another instruction. So the instruction can be immediately issued on the +JFPU1 pipeline. That is demonstrated by the fact that the instruction only +spent 1cy in the scheduler's queue. + +There is a gap of 5 cycles between the write-back stage and the retire event. +That is because instructions must retire in program order, so [1,0] has to wait +for [0,2] to be retired first (i.e., it has to wait until cycle 10). + +In the example, all instructions are in a RAW (Read After Write) dependency +chain. Register %xmm2 written by vmulps is immediately used by the first +vhaddps, and register %xmm3 written by the first vhaddps is used by the second +vhaddps. Long data dependencies negatively impact the ILP (Instruction Level +Parallelism). + +In the dot-product example, there are anti-dependencies introduced by +instructions from different iterations. However, those dependencies can be +removed at register renaming stage (at the cost of allocating register aliases, +and therefore consuming temporary registers). + +Table *Average Wait times* helps diagnose performance issues that are caused by +the presence of long latency instructions and potentially long data dependencies +which may limit the ILP. Note that MCA, by default, assumes at least 1cy +between the dispatch event and the issue event. + +When the performance is limited by data dependencies and/or long latency +instructions, the number of cycles spent while in the *ready* state is expected +to be very small when compared with the total number of cycles spent in the +scheduler's queue. The difference between the two counters is a good indicator +of how large of an impact data dependencies had on the execution of the +instructions. When performance is mostly limited by the lack of hardware +resources, the delta between the two counters is small. However, the number of +cycles spent in the queue tends to be larger (i.e., more than 1-3cy), +especially when compared to other low latency instructions.