Index: llvm/trunk/docs/CommandGuide/llvm-mca.rst
===================================================================
--- llvm/trunk/docs/CommandGuide/llvm-mca.rst
+++ llvm/trunk/docs/CommandGuide/llvm-mca.rst
@@ -21,9 +21,9 @@
 when run on the target, but also help with diagnosing potential performance
 issues.
 
-Given an assembly code sequence, llvm-mca estimates the IPC (Instructions Per
-Cycle), as well as hardware resource pressure. The analysis and reporting style
-were inspired by the IACA tool from Intel.
+Given an assembly code sequence, llvm-mca estimates the IPC, as well as
+hardware resource pressure. The analysis and reporting style were inspired by
+the IACA tool from Intel.
 
 :program:`llvm-mca` allows the usage of special code comments to mark regions of
 the assembly code to be analyzed.  A comment starting with substring
@@ -207,3 +207,223 @@
 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
 to standard error, and the tool returns 1.
 
+HOW MCA WORKS
+-------------
+
+MCA takes assembly code as input. The assembly code is parsed into a sequence
+of MCInst with the help of the existing LLVM target assembly parsers. The
+parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
+a performance report.
+
+The Pipeline module simulates the execution of the machine code sequence in a
+loop of iterations (default is 100). During this process, the pipeline collects
+a number of execution related statistics. At the end of this process, the
+pipeline generates and prints a report from the collected statistics.
+
+Here is an example of a performance report generated by MCA for a dot-product
+of two packed float vectors of four elements. The analysis is conducted for
+target x86, cpu btver2.  The following result can be produced via the following
+command using the example located at
+``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
+
+.. code-block:: bash
+
+  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
+
+.. code-block:: none
+
+  Iterations:        300
+  Instructions:      900
+  Total Cycles:      610
+  Dispatch Width:    2
+  IPC:               1.48
+  Block RThroughput: 2.0
+
+
+  Instruction Info:
+  [1]: #uOps
+  [2]: Latency
+  [3]: RThroughput
+  [4]: MayLoad
+  [5]: MayStore
+  [6]: HasSideEffects (U)
+
+  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
+   1      2     1.00                        vmulps	%xmm0, %xmm1, %xmm2
+   1      3     1.00                        vhaddps	%xmm2, %xmm2, %xmm3
+   1      3     1.00                        vhaddps	%xmm3, %xmm3, %xmm4
+
+
+  Resources:
+  [0]   - JALU0
+  [1]   - JALU1
+  [2]   - JDiv
+  [3]   - JFPA
+  [4]   - JFPM
+  [5]   - JFPU0
+  [6]   - JFPU1
+  [7]   - JLAGU
+  [8]   - JMul
+  [9]   - JSAGU
+  [10]  - JSTC
+  [11]  - JVALU0
+  [12]  - JVALU1
+  [13]  - JVIMUL
+
+
+  Resource pressure per iteration:
+  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
+   -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
+
+  Resource pressure by instruction:
+  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
+   -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps	%xmm0, %xmm1, %xmm2
+   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm2, %xmm2, %xmm3
+   -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps	%xmm3, %xmm3, %xmm4
+
+According to this report, the dot-product kernel has been executed 300 times,
+for a total of 900 dynamically executed instructions.
+
+The report is structured in three main sections.  The first section collects a
+few performance numbers; the goal of this section is to give a very quick
+overview of the performance throughput. In this example, the two important
+performance indicators are the predicted total number of cycles, and the
+Instructions Per Cycle (IPC).  IPC is probably the most important throughput
+indicator. A big delta between the Dispatch Width and the computed IPC is an
+indicator of potential performance issues.
+
+The second section of the report shows the latency and reciprocal
+throughput of every instruction in the sequence. That section also reports
+extra information related to the number of micro opcodes, and opcode properties
+(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
+
+The third section is the *Resource pressure view*.  This view reports
+the average number of resource cycles consumed every iteration by instructions
+for every processor resource unit available on the target.  Information is
+structured in two tables. The first table reports the number of resource cycles
+spent on average every iteration. The second table correlates the resource
+cycles to the machine instruction in the sequence. For example, every iteration
+of the instruction vmulps always executes on resource unit [6]
+(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
+per iteration.  Note that on Jaguar, vector floating-point multiply can only be
+issued to pipeline JFPU1, while horizontal floating-point additions can only be
+issued to pipeline JFPU0.
+
+The resource pressure view helps with identifying bottlenecks caused by high
+usage of specific hardware resources.  Situations with resource pressure mainly
+concentrated on a few resources should, in general, be avoided.  Ideally,
+pressure should be uniformly distributed between multiple resources.
+
+Timeline View
+^^^^^^^^^^^^^
+MCA's timeline view produces a detailed report of each instruction's state
+transitions through an instruction pipeline.  This view is enabled by the
+command line option ``-timeline``.  As instructions transition through the
+various stages of the pipeline, their states are depicted in the view report.
+These states are represented by the following characters:
+
+* D : Instruction dispatched.
+* e : Instruction executing.
+* E : Instruction executed.
+* R : Instruction retired.
+* = : Instruction already dispatched, waiting to be executed.
+* \- : Instruction executed, waiting to be retired.
+
+Below is the timeline view for a subset of the dot-product example located in
+``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
+MCA using the following command:
+
+.. code-block:: bash
+
+  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
+
+.. code-block:: none
+
+  Timeline view:
+                      012345
+  Index     0123456789
+
+  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2
+  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3
+  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4
+  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2
+  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
+  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
+  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
+  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
+  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4
+
+
+  Average Wait times (based on the timeline view):
+  [0]: Executions
+  [1]: Average time spent waiting in a scheduler's queue
+  [2]: Average time spent waiting in a scheduler's queue while ready
+  [3]: Average time elapsed from WB until retire stage
+
+        [0]    [1]    [2]    [3]
+  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
+  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
+  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
+
+The timeline view is interesting because it shows instruction state changes
+during execution.  It also gives an idea of how MCA processes instructions
+executed on the target, and how their timing information might be calculated.
+
+The timeline view is structured in two tables.  The first table shows
+instructions changing state over time (measured in cycles); the second table
+(named *Average Wait times*) reports useful timing statistics, which should
+help diagnose performance bottlenecks caused by long data dependencies and
+sub-optimal usage of hardware resources.
+
+An instruction in the timeline view is identified by a pair of indices, where
+the first index identifies an iteration, and the second index is the
+instruction index (i.e., where it appears in the code sequence).  Since this
+example was generated using 3 iterations: ``-iterations=3``, the iteration
+indices range from 0-2 inclusively.
+
+Excluding the first and last column, the remaining columns are in cycles.
+Cycles are numbered sequentially starting from 0.
+
+From the example output above, we know the following:
+
+* Instruction [1,0] was dispatched at cycle 1.
+* Instruction [1,0] started executing at cycle 2.
+* Instruction [1,0] reached the write back stage at cycle 4.
+* Instruction [1,0] was retired at cycle 10.
+
+Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
+scheduler's queue for the operands to become available. By the time vmulps is
+dispatched, operands are already available, and pipeline JFPU1 is ready to
+serve another instruction.  So the instruction can be immediately issued on the
+JFPU1 pipeline. That is demonstrated by the fact that the instruction only
+spent 1cy in the scheduler's queue.
+
+There is a gap of 5 cycles between the write-back stage and the retire event.
+That is because instructions must retire in program order, so [1,0] has to wait
+for [0,2] to be retired first (i.e., it has to wait until cycle 10).
+
+In the example, all instructions are in a RAW (Read After Write) dependency
+chain.  Register %xmm2 written by vmulps is immediately used by the first
+vhaddps, and register %xmm3 written by the first vhaddps is used by the second
+vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
+Parallelism).
+
+In the dot-product example, there are anti-dependencies introduced by
+instructions from different iterations.  However, those dependencies can be
+removed at register renaming stage (at the cost of allocating register aliases,
+and therefore consuming temporary registers).
+
+Table *Average Wait times* helps diagnose performance issues that are caused by
+the presence of long latency instructions and potentially long data dependencies
+which may limit the ILP.  Note that MCA, by default, assumes at least 1cy
+between the dispatch event and the issue event.
+
+When the performance is limited by data dependencies and/or long latency
+instructions, the number of cycles spent while in the *ready* state is expected
+to be very small when compared with the total number of cycles spent in the
+scheduler's queue.  The difference between the two counters is a good indicator
+of how large of an impact data dependencies had on the execution of the
+instructions.  When performance is mostly limited by the lack of hardware
+resources, the delta between the two counters is small.  However, the number of
+cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
+especially when compared to other low latency instructions.