diff --git a/llvm/docs/CommandGuide/llvm-exegesis.rst b/llvm/docs/CommandGuide/llvm-exegesis.rst --- a/llvm/docs/CommandGuide/llvm-exegesis.rst +++ b/llvm/docs/CommandGuide/llvm-exegesis.rst @@ -89,18 +89,18 @@ .. code-block:: bash - $ llvm-exegesis -mode=latency -opcode-name=ADD64rr + $ llvm-exegesis --mode=latency --opcode-name=ADD64rr Measuring the uop decomposition or inverse throughput of an instruction works similarly: .. code-block:: bash - $ llvm-exegesis -mode=uops -opcode-name=ADD64rr - $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr + $ llvm-exegesis --mode=uops --opcode-name=ADD64rr + $ llvm-exegesis --mode=inverse_throughput --opcode-name=ADD64rr The output is a YAML document (the default is to write to stdout, but you can -redirect the output to a file using `-benchmarks-file`): +redirect the output to a file using `--benchmarks-file`): .. code-block:: none @@ -125,7 +125,7 @@ .. code-block:: bash - $ llvm-exegesis -mode=latency -opcode-index=-1 + $ llvm-exegesis --mode=latency --opcode-index=-1 EXAMPLE 2: benchmarking a custom code snippet @@ -136,7 +136,7 @@ .. code-block:: bash - $ echo "vzeroupper" | llvm-exegesis -mode=uops -snippets-file=- + $ echo "vzeroupper" | llvm-exegesis --mode=uops --snippets-file=- Real-life code snippets typically depend on registers or memory. :program:`llvm-exegesis` checks the liveliness of registers (i.e. any register @@ -189,10 +189,10 @@ .. code-block:: bash - $ llvm-exegesis -mode=analysis \ - -benchmarks-file=/tmp/benchmarks.yaml \ - -analysis-clusters-output-file=/tmp/clusters.csv \ - -analysis-inconsistencies-output-file=/tmp/inconsistencies.html + $ llvm-exegesis --mode=analysis \ + --benchmarks-file=/tmp/benchmarks.yaml \ + --analysis-clusters-output-file=/tmp/clusters.csv \ + --analysis-inconsistencies-output-file=/tmp/inconsistencies.html This will group the instructions into clusters with the same performance characteristics. The clusters will be written out to `/tmp/clusters.csv` in the @@ -230,28 +230,28 @@ OPTIONS ------- -.. option:: -help +.. option:: --help Print a summary of command line options. -.. option:: -opcode-index= +.. option:: --opcode-index= Specify the opcode to measure, by index. Specifying `-1` will result in measuring every existing opcode. See example 1 for details. Either `opcode-index`, `opcode-name` or `snippets-file` must be set. -.. option:: -opcode-name=,,... +.. option:: --opcode-name=,,... Specify the opcode to measure, by name. Several opcodes can be specified as a comma-separated list. See example 1 for details. Either `opcode-index`, `opcode-name` or `snippets-file` must be set. -.. option:: -snippets-file= +.. option:: --snippets-file= Specify the custom code snippet to measure. See example 2 for details. Either `opcode-index`, `opcode-name` or `snippets-file` must be set. -.. option:: -mode=[latency|uops|inverse_throughput|analysis] +.. option:: --mode=[latency|uops|inverse_throughput|analysis] Specify the run mode. Note that some modes have additional requirements and options. @@ -274,7 +274,7 @@ * ``assemble-measured-code``: Same as ``prepare-and-assemble-snippet``. but also creates the full sequence that can be dumped to a file using ``--dump-object-to-disk``. * ``measure``: Same as ``assemble-measured-code``, but also runs the measurement. -.. option:: -x86-lbr-sample-period= +.. option:: --x86-lbr-sample-period= Specify the LBR sampling period - how many branches before we take a sample. When a positive value is specified for this option and when the mode is `latency`, @@ -283,7 +283,7 @@ could occur if the sampling is too frequent. A prime number should be used to avoid consistently skipping certain blocks. -.. option:: -x86-disable-upper-sse-registers +.. option:: --x86-disable-upper-sse-registers Using the upper xmm registers (xmm8-xmm15) forces a longer instruction encoding which may put greater pressure on the frontend fetch and decode stages, @@ -292,7 +292,7 @@ enabled can help determine the effects of the frontend and can be used to improve latency and throughput estimates. -.. option:: -repetition-mode=[duplicate|loop|min] +.. option:: --repetition-mode=[duplicate|loop|min] Specify the repetition mode. `duplicate` will create a large, straight line basic block with `num-repetitions` instructions (repeating the snippet @@ -307,13 +307,13 @@ instead use the `min` mode, which will run each other mode, and produce the minimal measured result. -.. option:: -num-repetitions= +.. option:: --num-repetitions= Specify the target number of executed instructions. Note that the actual repetition count of the snippet will be `num-repetitions`/`snippet size`. Higher values lead to more accurate measurements but lengthen the benchmark. -.. option:: -loop-body-size= +.. option:: --loop-body-size= Only effective for `-repetition-mode=[loop|min]`. Instead of looping over the snippet directly, first duplicate it so that the @@ -321,7 +321,7 @@ in loop body being cached in the CPU Op Cache / Loop Cache, which allows to which may have higher throughput than the CPU decoders. -.. option:: -max-configs-per-opcode= +.. option:: --max-configs-per-opcode= Specify the maximum configurations that can be generated for each opcode. By default this is `1`, meaning that we assume that a single measurement is @@ -333,22 +333,22 @@ lead to different performance characteristics. -.. option:: -benchmarks-file= +.. option:: --benchmarks-file= File to read (`analysis` mode) or write (`latency`/`uops`/`inverse_throughput` modes) benchmark results. "-" uses stdin/stdout. -.. option:: -analysis-clusters-output-file= +.. option:: --analysis-clusters-output-file= If provided, write the analysis clusters as CSV to this file. "-" prints to stdout. By default, this analysis is not run. -.. option:: -analysis-inconsistencies-output-file= +.. option:: --analysis-inconsistencies-output-file= If non-empty, write inconsistencies found during analysis to this file. `-` prints to stdout. By default, this analysis is not run. -.. option:: -analysis-filter=[all|reg-only|mem-only] +.. option:: --analysis-filter=[all|reg-only|mem-only] By default, all benchmark results are analysed, but sometimes it may be useful to only look at those that to not involve memory, or vice versa. This option @@ -356,44 +356,44 @@ ones that do involve memory (involve instructions that may read or write to memory), or the opposite, to only keep such benchmarks. -.. option:: -analysis-clustering=[dbscan,naive] +.. option:: --analysis-clustering=[dbscan,naive] Specify the clustering algorithm to use. By default DBSCAN will be used. Naive clustering algorithm is better for doing further work on the `-analysis-inconsistencies-output-file=` output, it will create one cluster per opcode, and check that the cluster is stable (all points are neighbours). -.. option:: -analysis-numpoints= +.. option:: --analysis-numpoints= Specify the numPoints parameters to be used for DBSCAN clustering (`analysis` mode, DBSCAN only). -.. option:: -analysis-clustering-epsilon= +.. option:: --analysis-clustering-epsilon= Specify the epsilon parameter used for clustering of benchmark points (`analysis` mode). -.. option:: -analysis-inconsistency-epsilon= +.. option:: --analysis-inconsistency-epsilon= Specify the epsilon parameter used for detection of when the cluster is different from the LLVM schedule profile values (`analysis` mode). -.. option:: -analysis-display-unstable-clusters +.. option:: --analysis-display-unstable-clusters If there is more than one benchmark for an opcode, said benchmarks may end up not being clustered into the same cluster if the measured performance characteristics are different. by default all such opcodes are filtered out. This flag will instead show only such unstable opcodes. -.. option:: -ignore-invalid-sched-class=false +.. option:: --ignore-invalid-sched-class=false If set, ignore instructions that do not have a sched class (class idx = 0). -.. option:: -mtriple= +.. option:: --mtriple= Target triple. See `-version` for available targets. -.. option:: -mcpu= +.. option:: --mcpu= If set, measure the cpu characteristics using the counters for this CPU. This is useful when creating new sched models (the host CPU is unknown to LLVM).