Index: docs/CommandGuide/llvm-mca.rst
===================================================================
--- docs/CommandGuide/llvm-mca.rst
+++ docs/CommandGuide/llvm-mca.rst
@@ -207,3 +207,121 @@
 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
 to standard error, and the tool returns 1.
 
+VIEW DESCRIPTIONS
+-----------------
+The following section describes the various Views that :program:`llvm-mca` can
+generate.
+
+Timeline View
+^^^^^^^^^^^^^
+MCA's timeline view produces a detailed report of each instruction's state
+transitions through an instruction pipeline.  This view is enabled by the
+command line option ``-timeline``.  As instructions transition through the
+various stages of the pipeline, their states are depicted in the view report.
+These states are represented by the following characters:
+
+* D : Instruction dispatched.
+* e : Instruction executing.
+* E : Instruction executed.
+* R : Instruction retired.
+* = : Instruction already dispatched, waiting to be executed.
+* \- : Instruction executed, waiting to be retired.
+
+Below is the timeline view for a subset of the dot-product example located in
+``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
+MCA using the following command:
+
+.. code-block:: bash
+
+  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
+
+.. code-block:: none
+
+  Timeline view:
+                      012345
+  Index     0123456789
+
+  [0,0]     DeeER.    .    .   vmulps	%xmm0, %xmm1, %xmm2
+  [0,1]     D==eeeER  .    .   vhaddps	%xmm2, %xmm2, %xmm3
+  [0,2]     .D====eeeER    .   vhaddps	%xmm3, %xmm3, %xmm4
+  [1,0]     .DeeE-----R    .   vmulps	%xmm0, %xmm1, %xmm2
+  [1,1]     . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
+  [1,2]     . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
+  [2,0]     .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
+  [2,1]     .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
+  [2,2]     .   D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4
+
+
+  Average Wait times (based on the timeline view):
+  [0]: Executions
+  [1]: Average time spent waiting in a scheduler's queue
+  [2]: Average time spent waiting in a scheduler's queue while ready
+  [3]: Average time elapsed from WB until retire stage
+
+        [0]    [1]    [2]    [3]
+  0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
+  1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
+  2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
+
+The timeline view is interesting because it shows instruction state changes
+during execution.  It also gives an idea of how MCA processes instructions
+executed on the target, and how their timing information might be calculated.
+
+The timeline view is structured in two tables.  The first table shows
+instructions changing state over time (measured in cycles); the second table
+(named *Average Wait times*) reports useful timing statistics, which should
+help diagnose performance bottlenecks caused by long data dependencies and
+sub-optimal usage of hardware resources.
+
+An instruction in the timeline view is identified by a pair of indices, where
+the first index identifies an iteration, and the second index is the
+instruction index (i.e., where it appears in the code sequence).  Since this
+example was generated using 3 iterations: ``-iterations=3``, the iteration
+indices range from 0-2 inclusively.
+
+Excluding the first and last column, the remaining columns are in cycles.
+Cycles are numbered sequentially starting from 0.
+
+From the example output above, we know the following:
+
+* Instruction [1,0] was dispatched at cycle 1.
+* Instruction [1,0] started executing at cycle 2.
+* Instruction [1,0] reached the write back stage at cycle 4.
+* Instruction [1,0] was retired at cycle 10.
+
+Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
+scheduler's queue for the operands to become available. By the time vmulps is
+dispatched, operands are already available, and pipeline JFPU1 is ready to
+serve another instruction.  So the instruction can be immediately issued on the
+JFPU1 pipeline. That is demonstrated by the fact that the instruction only
+spent 1cy in the scheduler's queue.
+
+There is a gap of 5 cycles between the write-back stage and the retire event.
+That is because instructions must retire in program order, so [1,0] has to wait
+for [0,2] to be retired first (i.e., it has to wait until cycle 10).
+
+In the example, all instructions are in a RAW (Read After Write) dependency
+chain.  Register %xmm2 written by vmulps is immediately used by the first
+vhaddps, and register %xmm3 written by the first vhaddps is used by the second
+vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
+Parallelism).
+
+In the dot-product example, there are anti-dependencies introduced by
+instructions from different iterations.  However, those dependencies can be
+removed at register renaming stage (at the cost of allocating register aliases,
+and therefore consuming temporary registers).
+
+Table *Average Wait times* helps diagnose performance issues that are caused by
+the presence of long latency instructions and potentially long data dependencies
+which may limit the ILP.  Note that MCA, by default, assumes at least 1cy
+between the dispatch event and the issue event.
+
+When the performance is limited by data dependencies and/or long latency
+instructions, the number of cycles spent while in the *ready* state is expected
+to be very small when compared with the total number of cycles spent in the
+scheduler's queue.  The difference between the two counters is a good indicator
+of how large of an impact data dependencies had on the execution of the
+instructions.  When performance is mostly limited by the lack of hardware
+resources, the delta between the two counters is small.  However, the number of
+cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
+especially when compared to other low latency instructions.