This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] run LSR before straight-line optimizations
ClosedPublic

Authored by jingyue on Jul 17 2015, 11:03 AM.

Download Raw Diff

Details

Reviewers

jholewinski
eliben

Commits

rG6a3fdeca224e: [NVPTX] run LSR before straight-line optimizations
rL242982: [NVPTX] run LSR before straight-line optimizations

Summary

Straight-line optimizations can simplify the loop body and make LSR's
cost analysis more precise. This significantly improves several Eigen3
CUDA benchmarks.

With this change, EigenContractionKernel runs up to 40% faster
(https://bitbucket.org/eigen/eigen/src/753ceee5f206ff7dde9f6a41a5a420749fc9406f/unsupported/Eigen/CXX11/src/Tensor/TensorContractionCuda.h?at=default#cl-502).
EigenConvolutionKernel2D runs up to 10% faster
(https://bitbucket.org/eigen/eigen/src/753ceee5f206ff7dde9f6a41a5a420749fc9406f/unsupported/Eigen/CXX11/src/Tensor/TensorConvolution.h?at=default#cl-605).

I have some difficulties writing small tests that benefit from this
reordering due to a seemingly issue with LSR (being discussed at
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-July/088244.html).

See the review thread for the compilation time impact of GVN.

Diff Detail

Repository: rL LLVM

Event Timeline

jingyue updated this revision to Diff 30015.Jul 17 2015, 11:03 AM

jingyue retitled this revision from to [NVPTX] run LSR before straight-line optimizations.

jingyue updated this object.

jingyue added reviewers: jholewinski, eliben.

jingyue added a subscriber: llvm-commits.

Herald added a subscriber: jholewinski. · View Herald TranscriptJul 17 2015, 11:03 AM

Looks reasonable to me.

What is the impact on compile time of adding this extra GVN pass?

Below is the compilation time breakdown of running "opt -O3 and llc" on one of our GPU program. It leads to >100k lines of PTX.

This extra GVN takes 2.6% of the time. There are three GVNs in the list. The first one (4.8%) happens in the target-independent stage. The other two happen in NVPTX's private pipeline.

I'll add a check to enable it for -O3 only.

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 8.3537 seconds (8.3467 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.9868 ( 11.9%)   0.0161 ( 32.1%)   1.0029 ( 12.0%)   1.0048 ( 12.0%)  NVPTX DAG->DAG Pattern Instruction Selection
   0.9987 ( 12.0%)   0.0000 (  0.0%)   0.9987 ( 12.0%)   0.9995 ( 12.0%)  Straight line strength reduction
   0.4514 (  5.4%)   0.0000 (  0.0%)   0.4514 (  5.4%)   0.4490 (  5.4%)  Function Integration/Inlining
   0.4348 (  5.2%)   0.0000 (  0.0%)   0.4348 (  5.2%)   0.4354 (  5.2%)  Nary reassociation
   0.4033 (  4.9%)   0.0000 (  0.0%)   0.4033 (  4.8%)   0.4002 (  4.8%)  Global Value Numbering
   0.2823 (  3.4%)   0.0001 (  0.1%)   0.2824 (  3.4%)   0.2780 (  3.3%)  Combine redundant instructions
   0.2696 (  3.2%)   0.0002 (  0.3%)   0.2697 (  3.2%)   0.2647 (  3.2%)  Combine redundant instructions
   0.2423 (  2.9%)   0.0000 (  0.0%)   0.2423 (  2.9%)   0.2387 (  2.9%)  Combine redundant instructions
   0.2328 (  2.8%)   0.0000 (  0.1%)   0.2328 (  2.8%)   0.2291 (  2.7%)  Combine redundant instructions
   0.2232 (  2.7%)   0.0000 (  0.0%)   0.2232 (  2.7%)   0.2223 (  2.7%)  Global Value Numbering
   0.2123 (  2.6%)   0.0001 (  0.1%)   0.2124 (  2.5%)   0.2161 (  2.6%)  Global Value Numbering
   0.2000 (  2.4%)   0.0002 (  0.4%)   0.2001 (  2.4%)   0.1944 (  2.3%)  Loop Invariant Code Motion
   0.1929 (  2.3%)   0.0001 (  0.3%)   0.1931 (  2.3%)   0.1927 (  2.3%)  Combine redundant instructions
   0.1928 (  2.3%)   0.0000 (  0.0%)   0.1928 (  2.3%)   0.1924 (  2.3%)  Combine redundant instructions
   0.1907 (  2.3%)   0.0000 (  0.0%)   0.1907 (  2.3%)   0.1919 (  2.3%)  Value Propagation
   0.1759 (  2.1%)   0.0006 (  1.1%)   0.1764 (  2.1%)   0.1715 (  2.1%)  Induction Variable Simplification
   0.1735 (  2.1%)   0.0001 (  0.1%)   0.1736 (  2.1%)   0.1714 (  2.1%)  Loop Invariant Code Motion
   0.1671 (  2.0%)   0.0010 (  2.0%)   0.1682 (  2.0%)   0.1714 (  2.1%)  Combine redundant instructions
   0.1426 (  1.7%)   0.0000 (  0.0%)   0.1426 (  1.7%)   0.1415 (  1.7%)  Loop Invariant Code Motion
   0.1302 (  1.6%)   0.0001 (  0.2%)   0.1304 (  1.6%)   0.1302 (  1.6%)  Loop Strength Reduction
   0.1226 (  1.5%)   0.0000 (  0.0%)   0.1226 (  1.5%)   0.1287 (  1.5%)  Unroll loops
   0.1248 (  1.5%)   0.0002 (  0.4%)   0.1251 (  1.5%)   0.1269 (  1.5%)  SROA
   0.0994 (  1.2%)   0.0000 (  0.0%)   0.0994 (  1.2%)   0.0988 (  1.2%)  Value Propagation
   0.0991 (  1.2%)   0.0000 (  0.0%)   0.0991 (  1.2%)   0.0979 (  1.2%)  Combine redundant instructions
   0.0748 (  0.9%)   0.0039 (  7.8%)   0.0787 (  0.9%)   0.0752 (  0.9%)  Simple Register Coalescing
   0.0659 (  0.8%)   0.0000 (  0.0%)   0.0659 (  0.8%)   0.0664 (  0.8%)  Induction Variable Users
   0.0659 (  0.8%)   0.0038 (  7.6%)   0.0697 (  0.8%)   0.0633 (  0.8%)  Early CSE
   0.0639 (  0.8%)   0.0000 (  0.0%)   0.0639 (  0.8%)   0.0631 (  0.8%)  Sparse Conditional Constant Propagation
   0.0632 (  0.8%)   0.0000 (  0.1%)   0.0632 (  0.8%)   0.0625 (  0.7%)  Early CSE
   0.0579 (  0.7%)   0.0000 (  0.0%)   0.0579 (  0.7%)   0.0580 (  0.7%)  NVPTX Assembly Printer
   0.0565 (  0.7%)   0.0000 (  0.0%)   0.0565 (  0.7%)   0.0566 (  0.7%)  CodeGen Prepare
   0.0564 (  0.7%)   0.0000 (  0.0%)   0.0564 (  0.7%)   0.0553 (  0.7%)  Live Interval Analysis
   0.0574 (  0.7%)   0.0001 (  0.2%)   0.0575 (  0.7%)   0.0530 (  0.6%)  Early CSE
   0.0506 (  0.6%)   0.0000 (  0.1%)   0.0506 (  0.6%)   0.0468 (  0.6%)  Dead Store Elimination
   0.0437 (  0.5%)   0.0000 (  0.0%)   0.0437 (  0.5%)   0.0436 (  0.5%)  Dead Code Elimination
   0.0438 (  0.5%)   0.0002 (  0.3%)   0.0439 (  0.5%)   0.0424 (  0.5%)  Bit-Tracking Dead Code Elimination
   0.0428 (  0.5%)   0.0000 (  0.0%)   0.0428 (  0.5%)   0.0419 (  0.5%)  Machine Loop Invariant Code Motion
   0.0415 (  0.5%)   0.0000 (  0.0%)   0.0415 (  0.5%)   0.0405 (  0.5%)  Module Verifier
   0.0413 (  0.5%)   0.0000 (  0.0%)   0.0413 (  0.5%)   0.0401 (  0.5%)  SROA
   0.0367 (  0.4%)   0.0000 (  0.0%)   0.0367 (  0.4%)   0.0373 (  0.4%)  Machine Common Subexpression Elimination
   0.0352 (  0.4%)   0.0000 (  0.0%)   0.0352 (  0.4%)   0.0351 (  0.4%)  Interprocedural Sparse Conditional Constant Propagation
   0.0339 (  0.4%)   0.0000 (  0.0%)   0.0339 (  0.4%)   0.0342 (  0.4%)  Module Verifier
   0.0349 (  0.4%)   0.0001 (  0.2%)   0.0350 (  0.4%)   0.0318 (  0.4%)  Simplify the CFG
   0.0304 (  0.4%)   0.0000 (  0.0%)   0.0304 (  0.4%)   0.0299 (  0.4%)  Live Variable Analysis
   0.0278 (  0.3%)   0.0000 (  0.0%)   0.0278 (  0.3%)   0.0274 (  0.3%)  Reassociate expressions
   0.0263 (  0.3%)   0.0039 (  7.7%)   0.0301 (  0.4%)   0.0260 (  0.3%)  Aggressive Dead Code Elimination
   0.0215 (  0.3%)   0.0000 (  0.0%)   0.0215 (  0.3%)   0.0228 (  0.3%)  Jump Threading
   0.0211 (  0.3%)   0.0000 (  0.0%)   0.0211 (  0.3%)   0.0205 (  0.2%)  Split GEPs to a variadic base and a constant offset for better CSE
   0.0193 (  0.2%)   0.0000 (  0.0%)   0.0193 (  0.2%)   0.0200 (  0.2%)  Simplify the CFG
   0.0131 (  0.2%)   0.0038 (  7.6%)   0.0169 (  0.2%)   0.0175 (  0.2%)  Unnamed pass: implement Pass::getPassName()
   0.0169 (  0.2%)   0.0000 (  0.0%)   0.0170 (  0.2%)   0.0174 (  0.2%)  Rotate Loops
   0.0166 (  0.2%)   0.0000 (  0.0%)   0.0166 (  0.2%)   0.0170 (  0.2%)  convert address space of alloca'ed memory to local
   0.0148 (  0.2%)   0.0039 (  7.8%)   0.0188 (  0.2%)   0.0168 (  0.2%)  Lower aggregate copies/intrinsics into loops
   0.0168 (  0.2%)   0.0000 (  0.0%)   0.0168 (  0.2%)   0.0164 (  0.2%)  Machine code sinking
   0.0171 (  0.2%)   0.0000 (  0.0%)   0.0171 (  0.2%)   0.0160 (  0.2%)  Unroll loops
   0.0156 (  0.2%)   0.0000 (  0.0%)   0.0156 (  0.2%)   0.0160 (  0.2%)  Simplify the CFG
   0.0107 (  0.1%)   0.0000 (  0.0%)   0.0107 (  0.1%)   0.0154 (  0.2%)  Recognize loop idioms
   0.0141 (  0.2%)   0.0000 (  0.0%)   0.0141 (  0.2%)   0.0141 (  0.2%)  Eliminate PHI nodes for register allocation
   0.0127 (  0.2%)   0.0000 (  0.1%)   0.0128 (  0.2%)   0.0127 (  0.2%)  Unnamed pass: implement Pass::getPassName()
   0.0099 (  0.1%)   0.0000 (  0.0%)   0.0099 (  0.1%)   0.0111 (  0.1%)  Jump Threading
   0.0111 (  0.1%)   0.0000 (  0.0%)   0.0111 (  0.1%)   0.0111 (  0.1%)  Remove unused exception handling info
   0.0105 (  0.1%)   0.0000 (  0.0%)   0.0106 (  0.1%)   0.0107 (  0.1%)  Simplify the CFG
   0.0101 (  0.1%)   0.0000 (  0.0%)   0.0101 (  0.1%)   0.0102 (  0.1%)  Simplify the CFG
   0.0095 (  0.1%)   0.0000 (  0.0%)   0.0095 (  0.1%)   0.0098 (  0.1%)  Float to int
   0.0066 (  0.1%)   0.0000 (  0.0%)   0.0066 (  0.1%)   0.0095 (  0.1%)  Tail Call Elimination
   0.0094 (  0.1%)   0.0000 (  0.0%)   0.0094 (  0.1%)   0.0094 (  0.1%)  Dead Global Elimination
   0.0051 (  0.1%)   0.0001 (  0.2%)   0.0052 (  0.1%)   0.0088 (  0.1%)  Simplify the CFG
   0.0060 (  0.1%)   0.0000 (  0.0%)   0.0060 (  0.1%)   0.0084 (  0.1%)  Loop-Closed SSA Form Pass
   0.0079 (  0.1%)   0.0000 (  0.0%)   0.0079 (  0.1%)   0.0082 (  0.1%)  Promote 'by reference' arguments to scalars
   0.0074 (  0.1%)   0.0000 (  0.0%)   0.0074 (  0.1%)   0.0074 (  0.1%)  Two-Address instruction pass
   0.0068 (  0.1%)   0.0000 (  0.0%)   0.0068 (  0.1%)   0.0073 (  0.1%)  Unswitch loops
   0.0033 (  0.0%)   0.0001 (  0.1%)   0.0034 (  0.0%)   0.0072 (  0.1%)  Dominator Tree Construction
   0.0031 (  0.0%)   0.0000 (  0.0%)   0.0031 (  0.0%)   0.0068 (  0.1%)  Deduce function attributes
   0.0048 (  0.1%)   0.0000 (  0.0%)   0.0048 (  0.1%)   0.0063 (  0.1%)  Lazy Value Information Analysis
   0.0040 (  0.0%)   0.0000 (  0.0%)   0.0040 (  0.0%)   0.0057 (  0.1%)  MemCpy Optimization
   0.0062 (  0.1%)   0.0000 (  0.0%)   0.0062 (  0.1%)   0.0055 (  0.1%)  Remove unnecessary non-generic-to-generic addrspacecasts
   0.0049 (  0.1%)   0.0000 (  0.0%)   0.0049 (  0.1%)   0.0052 (  0.1%)  SROA
   0.0054 (  0.1%)   0.0000 (  0.0%)   0.0054 (  0.1%)   0.0052 (  0.1%)  Peephole Optimizations
   0.0049 (  0.1%)   0.0038 (  7.6%)   0.0088 (  0.1%)   0.0050 (  0.1%)  Loop-Closed SSA Form Pass
   0.0042 (  0.1%)   0.0000 (  0.0%)   0.0042 (  0.1%)   0.0050 (  0.1%)  Slot index numbering
   0.0046 (  0.1%)   0.0000 (  0.0%)   0.0046 (  0.1%)   0.0048 (  0.1%)  CallGraph Construction
   0.0043 (  0.1%)   0.0000 (  0.0%)   0.0043 (  0.1%)   0.0047 (  0.1%)  Slot index numbering
   0.0025 (  0.0%)   0.0000 (  0.0%)   0.0025 (  0.0%)   0.0046 (  0.1%)  Dominator Tree Construction
   0.0050 (  0.1%)   0.0000 (  0.0%)   0.0050 (  0.1%)   0.0045 (  0.1%)  Dead Argument Elimination
   0.0049 (  0.1%)   0.0000 (  0.0%)   0.0049 (  0.1%)   0.0042 (  0.1%)  Remove dead machine instructions
   0.0020 (  0.0%)   0.0000 (  0.0%)   0.0020 (  0.0%)   0.0041 (  0.0%)  Natural Loop Information
   0.0043 (  0.1%)   0.0000 (  0.0%)   0.0043 (  0.1%)   0.0040 (  0.0%)  Dominator Tree Construction
   0.0042 (  0.1%)   0.0000 (  0.0%)   0.0042 (  0.0%)   0.0040 (  0.0%)  Loop-Closed SSA Form Pass
   0.0039 (  0.0%)   0.0000 (  0.0%)   0.0039 (  0.0%)   0.0038 (  0.0%)  Branch Probability Analysis
   0.0030 (  0.0%)   0.0000 (  0.0%)   0.0030 (  0.0%)   0.0038 (  0.0%)  Dominator Tree Construction
   0.0037 (  0.0%)   0.0000 (  0.0%)   0.0037 (  0.0%)   0.0038 (  0.0%)  Dominator Tree Construction
   0.0024 (  0.0%)   0.0000 (  0.0%)   0.0025 (  0.0%)   0.0037 (  0.0%)  Dominator Tree Construction
   0.0029 (  0.0%)   0.0000 (  0.0%)   0.0029 (  0.0%)   0.0037 (  0.0%)  Lazy Value Information Analysis
   0.0037 (  0.0%)   0.0000 (  0.0%)   0.0037 (  0.0%)   0.0036 (  0.0%)  Branch Probability Analysis
   0.0034 (  0.0%)   0.0000 (  0.0%)   0.0034 (  0.0%)   0.0035 (  0.0%)  Branch Probability Basic Block Placement
   0.0016 (  0.0%)   0.0000 (  0.0%)   0.0016 (  0.0%)   0.0034 (  0.0%)  Dominator Tree Construction
   0.0030 (  0.0%)   0.0000 (  0.0%)   0.0030 (  0.0%)   0.0033 (  0.0%)  Dominator Tree Construction
   0.0031 (  0.0%)   0.0000 (  0.0%)   0.0031 (  0.0%)   0.0033 (  0.0%)  Constant Hoisting
   0.0034 (  0.0%)   0.0000 (  0.0%)   0.0034 (  0.0%)   0.0032 (  0.0%)  Dominator Tree Construction
   0.0021 (  0.0%)   0.0035 (  6.9%)   0.0055 (  0.1%)   0.0032 (  0.0%)  Dominator Tree Construction
   0.0038 (  0.0%)   0.0000 (  0.0%)   0.0038 (  0.0%)   0.0032 (  0.0%)  Loop-Closed SSA Form Pass
   0.0034 (  0.0%)   0.0000 (  0.0%)   0.0034 (  0.0%)   0.0029 (  0.0%)  Loop-Closed SSA Form Pass
   0.0016 (  0.0%)   0.0000 (  0.0%)   0.0016 (  0.0%)   0.0029 (  0.0%)  Natural Loop Information
   0.0029 (  0.0%)   0.0000 (  0.0%)   0.0029 (  0.0%)   0.0028 (  0.0%)  Loop-Closed SSA Form Pass
   0.0026 (  0.0%)   0.0000 (  0.0%)   0.0026 (  0.0%)   0.0027 (  0.0%)  Partially inline calls to library functions
   0.0020 (  0.0%)   0.0000 (  0.0%)   0.0020 (  0.0%)   0.0026 (  0.0%)  Dominator Tree Construction
   0.0017 (  0.0%)   0.0000 (  0.0%)   0.0017 (  0.0%)   0.0026 (  0.0%)  Machine Function Analysis
   0.0020 (  0.0%)   0.0000 (  0.0%)   0.0020 (  0.0%)   0.0025 (  0.0%)  NVPTX specific alloca hoisting
   0.0023 (  0.0%)   0.0000 (  0.0%)   0.0023 (  0.0%)   0.0024 (  0.0%)  Dominator Tree Construction
   0.0030 (  0.0%)   0.0000 (  0.0%)   0.0030 (  0.0%)   0.0023 (  0.0%)  Dominator Tree Construction
   0.0026 (  0.0%)   0.0000 (  0.0%)   0.0026 (  0.0%)   0.0022 (  0.0%)  Post-RA pseudo instruction expansion pass
   0.0012 (  0.0%)   0.0000 (  0.0%)   0.0012 (  0.0%)   0.0022 (  0.0%)  Canonicalize natural loops
   0.0020 (  0.0%)   0.0000 (  0.0%)   0.0020 (  0.0%)   0.0022 (  0.0%)  Dominator Tree Construction
   0.0012 (  0.0%)   0.0000 (  0.0%)   0.0012 (  0.0%)   0.0022 (  0.0%)  Dominator Tree Construction
   0.0014 (  0.0%)   0.0000 (  0.0%)   0.0014 (  0.0%)   0.0021 (  0.0%)  Dominator Tree Construction
   0.0026 (  0.0%)   0.0000 (  0.0%)   0.0026 (  0.0%)   0.0021 (  0.0%)  Canonicalize natural loops
   0.0011 (  0.0%)   0.0000 (  0.0%)   0.0011 (  0.0%)   0.0021 (  0.0%)  Delete dead loops
   0.0019 (  0.0%)   0.0000 (  0.0%)   0.0019 (  0.0%)   0.0020 (  0.0%)  MachineDominator Tree Construction
   0.0017 (  0.0%)   0.0000 (  0.0%)   0.0017 (  0.0%)   0.0020 (  0.0%)  MachineDominator Tree Construction
   0.0023 (  0.0%)   0.0000 (  0.0%)   0.0023 (  0.0%)   0.0020 (  0.0%)  MachineDominator Tree Construction
   0.0016 (  0.0%)   0.0000 (  0.0%)   0.0016 (  0.0%)   0.0020 (  0.0%)  Dominator Tree Construction
   0.0012 (  0.0%)   0.0000 (  0.0%)   0.0012 (  0.0%)   0.0020 (  0.0%)  Dominator Tree Construction
   0.0003 (  0.0%)   0.0000 (  0.1%)   0.0003 (  0.0%)   0.0020 (  0.0%)  Lower 'expect' Intrinsics
   0.0016 (  0.0%)   0.0000 (  0.0%)   0.0016 (  0.0%)   0.0019 (  0.0%)  Block Frequency Analysis
   0.0017 (  0.0%)   0.0000 (  0.0%)   0.0017 (  0.0%)   0.0019 (  0.0%)  MachinePostDominator Tree Construction
   0.0014 (  0.0%)   0.0000 (  0.0%)   0.0014 (  0.0%)   0.0017 (  0.0%)  MachineDominator Tree Construction
   0.0011 (  0.0%)   0.0000 (  0.0%)   0.0011 (  0.0%)   0.0017 (  0.0%)  Natural Loop Information
   0.0015 (  0.0%)   0.0000 (  0.0%)   0.0015 (  0.0%)   0.0017 (  0.0%)  Machine Block Frequency Analysis
   0.0014 (  0.0%)   0.0000 (  0.0%)   0.0014 (  0.0%)   0.0017 (  0.0%)  Machine Block Frequency Analysis
   0.0011 (  0.0%)   0.0039 (  7.7%)   0.0050 (  0.1%)   0.0017 (  0.0%)  Scalar Evolution Analysis
   0.0009 (  0.0%)   0.0000 (  0.0%)   0.0009 (  0.0%)   0.0016 (  0.0%)  Natural Loop Information
   0.0017 (  0.0%)   0.0000 (  0.0%)   0.0017 (  0.0%)   0.0015 (  0.0%)  Machine Block Frequency Analysis
   0.0015 (  0.0%)   0.0000 (  0.0%)   0.0015 (  0.0%)   0.0015 (  0.0%)  Natural Loop Information
   0.0009 (  0.0%)   0.0000 (  0.0%)   0.0009 (  0.0%)   0.0014 (  0.0%)  Natural Loop Information
   0.0015 (  0.0%)   0.0000 (  0.0%)   0.0015 (  0.0%)   0.0014 (  0.0%)  Natural Loop Information
   0.0007 (  0.0%)   0.0000 (  0.0%)   0.0007 (  0.0%)   0.0014 (  0.0%)  Scalar Evolution Analysis
   0.0004 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.0%)   0.0014 (  0.0%)  Scalar Evolution Analysis
   0.0017 (  0.0%)   0.0000 (  0.0%)   0.0017 (  0.0%)   0.0014 (  0.0%)  Unnamed pass: implement Pass::getPassName()
   0.0004 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.0%)   0.0013 (  0.0%)  Canonicalize natural loops
   0.0011 (  0.0%)   0.0000 (  0.0%)   0.0011 (  0.0%)   0.0012 (  0.0%)  Global Variable Optimizer
   0.0011 (  0.0%)   0.0000 (  0.0%)   0.0011 (  0.0%)   0.0012 (  0.0%)  Canonicalize natural loops
   0.0012 (  0.0%)   0.0000 (  0.0%)   0.0012 (  0.0%)   0.0012 (  0.0%)  Merge disjoint stack slots
   0.0011 (  0.0%)   0.0000 (  0.0%)   0.0011 (  0.0%)   0.0012 (  0.0%)  Machine Natural Loop Construction
   0.0010 (  0.0%)   0.0000 (  0.0%)   0.0010 (  0.0%)   0.0011 (  0.0%)  Machine Natural Loop Construction
   0.0007 (  0.0%)   0.0000 (  0.0%)   0.0007 (  0.0%)   0.0011 (  0.0%)  Speculatively execute instructions
   0.0007 (  0.0%)   0.0000 (  0.0%)   0.0007 (  0.0%)   0.0011 (  0.0%)  Process Implicit Definitions
   0.0008 (  0.0%)   0.0000 (  0.0%)   0.0008 (  0.0%)   0.0011 (  0.0%)  Expand ISel Pseudo-instructions
   0.0008 (  0.0%)   0.0000 (  0.0%)   0.0008 (  0.0%)   0.0011 (  0.0%)  Machine Natural Loop Construction
   0.0007 (  0.0%)   0.0000 (  0.0%)   0.0007 (  0.0%)   0.0011 (  0.0%)  NVPTX optimize redundant cvta.to.local instruction
   0.0006 (  0.0%)   0.0000 (  0.0%)   0.0006 (  0.0%)   0.0011 (  0.0%)  Canonicalize natural loops
   0.0004 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.0%)   0.0010 (  0.0%)  Scalar Evolution Analysis
   0.0005 (  0.0%)   0.0000 (  0.0%)   0.0005 (  0.0%)   0.0008 (  0.0%)  MergedLoadStoreMotion
   0.0002 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.0%)   0.0008 (  0.0%)  Lower pointer arguments of CUDA kernels
   0.0008 (  0.0%)   0.0000 (  0.0%)   0.0008 (  0.0%)   0.0008 (  0.0%)  Remove unreachable blocks from the CFG
   0.0006 (  0.0%)   0.0000 (  0.0%)   0.0006 (  0.0%)   0.0007 (  0.0%)  Replace occurrences of __nvvm_reflect() calls with 0/1
   0.0004 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.0%)   0.0007 (  0.0%)  Memory Dependence Analysis
   0.0006 (  0.0%)   0.0000 (  0.0%)   0.0006 (  0.0%)   0.0007 (  0.0%)  Remove unreachable machine basic blocks
   0.0009 (  0.0%)   0.0000 (  0.0%)   0.0009 (  0.0%)   0.0006 (  0.0%)  Optimize machine instruction PHIs
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0006 (  0.0%)  Canonicalize natural loops
   0.0004 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.0%)   0.0005 (  0.0%)  Memory Dependence Analysis
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0005 (  0.0%)  Inline Cost Analysis
   0.0002 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.0%)   0.0005 (  0.0%)  Speculatively execute instructions
   0.0003 (  0.0%)   0.0000 (  0.0%)   0.0003 (  0.0%)   0.0005 (  0.0%)  Remove unreachable blocks from the CFG
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0005 (  0.0%)  Canonicalize natural loops
   0.0002 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.0%)   0.0005 (  0.0%)  Rotate Loops
   0.0003 (  0.0%)   0.0000 (  0.0%)   0.0003 (  0.0%)   0.0005 (  0.0%)  Memory Dependence Analysis
   0.0003 (  0.0%)   0.0000 (  0.0%)   0.0003 (  0.0%)   0.0004 (  0.0%)  Lower invoke and unwind, for unwindless code generators
   0.0003 (  0.0%)   0.0000 (  0.0%)   0.0003 (  0.0%)   0.0004 (  0.0%)  Tail Duplication
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0004 (  0.0%)  Memory Dependence Analysis
   0.0002 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.0%)   0.0003 (  0.0%)  SROA
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0003 (  0.0%)  Memory Dependence Analysis
   0.0004 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.0%)   0.0002 (  0.0%)  Internalize Global Symbols
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0002 (  0.0%)  Scalar Evolution Analysis
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  Insert stack protectors
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  SLP Vectorizer
   0.0002 (  0.0%)   0.0000 (  0.0%)   0.0002 (  0.0%)   0.0001 (  0.0%)  Loop Vectorization
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  Post RA top-down list latency scheduler
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  Strip Unused Function Prototypes
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  Loop Access Analysis
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  StackMap Liveness Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Scalar Evolution Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Safe Stack instrumentation pass
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Scalar Evolution Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Analyze Machine Code For Garbage Collection
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Machine Instruction Scheduler
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  Lower Garbage Collection Instructions
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0001 (  0.0%)  Scalar Evolution Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Alignment from assumptions
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Live Stack Slot Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Shadow Stack GC Lowering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)  Stack Slot Coloring
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Local Stack Slot Allocation
   0.0001 (  0.0%)   0.0000 (  0.0%)   0.0001 (  0.0%)   0.0000 (  0.0%)  Assign valid PTX names to globals
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Merge Duplicate Global Constants
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Basic Alias Analysis (stateless AA impl)
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Ensure that the global variables are in the global address space
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Unnamed pass: implement Pass::getPassName()
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Rewrite Symbols
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  No Alias Analysis (always returns 'may' alias)
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Create Garbage Collector Module Metadata
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Basic Alias Analysis (stateless AA impl)
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  A No-Op Barrier Pass
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Branch Probability Analysis
   8.3034 (100.0%)   0.0502 (100.0%)   8.3537 (100.0%)   8.3467 (100.0%)  Total

run GVN only under -O3

jingyue updated this object.Jul 22 2015, 9:59 PM

Closed by commit rL242982: [NVPTX] run LSR before straight-line optimizations (authored by jingyue). · Explain WhyJul 22 2015, 9:59 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

NVPTX/

NVPTXTargetMachine.cpp

37 lines

Diff 30446

llvm/trunk/lib/Target/NVPTX/NVPTXTargetMachine.cpp

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	public:
void addIRPasses() override;		void addIRPasses() override;
bool addInstSelector() override;		bool addInstSelector() override;
void addPostRegAlloc() override;		void addPostRegAlloc() override;
void addMachineSSAOptimization() override;		void addMachineSSAOptimization() override;

FunctionPass *createTargetRegisterAllocator(bool) override;		FunctionPass *createTargetRegisterAllocator(bool) override;
void addFastRegAlloc(FunctionPass *RegAllocPass) override;		void addFastRegAlloc(FunctionPass *RegAllocPass) override;
void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;		void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;

		private:
		// if the opt level is aggressive, add GVN; otherwise, add EarlyCSE.
		void addEarlyCSEOrGVNPass();
};		};
} // end anonymous namespace		} // end anonymous namespace

TargetPassConfig *NVPTXTargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *NVPTXTargetMachine::createPassConfig(PassManagerBase &PM) {
NVPTXPassConfig *PassConfig = new NVPTXPassConfig(this, PM);		NVPTXPassConfig *PassConfig = new NVPTXPassConfig(this, PM);
return PassConfig;		return PassConfig;
}		}

TargetIRAnalysis NVPTXTargetMachine::getTargetIRAnalysis() {		TargetIRAnalysis NVPTXTargetMachine::getTargetIRAnalysis() {
return TargetIRAnalysis([this](Function &F) {		return TargetIRAnalysis([this](Function &F) {
return TargetTransformInfo(NVPTXTTIImpl(this, F));		return TargetTransformInfo(NVPTXTTIImpl(this, F));
});		});
}		}

		void NVPTXPassConfig::addEarlyCSEOrGVNPass() {
		if (getOptLevel() == CodeGenOpt::Aggressive)
		addPass(createGVNPass());
		else
		addPass(createEarlyCSEPass());
		}

void NVPTXPassConfig::addIRPasses() {		void NVPTXPassConfig::addIRPasses() {
// The following passes are known to not play well with virtual regs hanging		// The following passes are known to not play well with virtual regs hanging
// around after register allocation (which in our case, is all registers).		// around after register allocation (which in our case, is all registers).
// We explicitly disable them here. We do, however, need some functionality		// We explicitly disable them here. We do, however, need some functionality
// of the PrologEpilogCodeInserter pass, so we emulate that behavior in the		// of the PrologEpilogCodeInserter pass, so we emulate that behavior in the
// NVPTXPrologEpilog pass (see NVPTXPrologEpilogPass.cpp).		// NVPTXPrologEpilog pass (see NVPTXPrologEpilogPass.cpp).
disablePass(&PrologEpilogCodeInserterID);		disablePass(&PrologEpilogCodeInserterID);
disablePass(&MachineCopyPropagationID);		disablePass(&MachineCopyPropagationID);
disablePass(&TailDuplicateID);		disablePass(&TailDuplicateID);

addPass(createNVPTXImageOptimizerPass());		addPass(createNVPTXImageOptimizerPass());
TargetPassConfig::addIRPasses();
addPass(createNVPTXAssignValidGlobalNamesPass());		addPass(createNVPTXAssignValidGlobalNamesPass());
addPass(createGenericToNVVMPass());		addPass(createGenericToNVVMPass());

		// === Propagate special address spaces ===
addPass(createNVPTXLowerKernelArgsPass(&getNVPTXTargetMachine()));		addPass(createNVPTXLowerKernelArgsPass(&getNVPTXTargetMachine()));
// NVPTXLowerKernelArgs emits alloca for byval parameters which can often		// NVPTXLowerKernelArgs emits alloca for byval parameters which can often
// be eliminated by SROA.		// be eliminated by SROA.
addPass(createSROAPass());		addPass(createSROAPass());
addPass(createNVPTXLowerAllocaPass());		addPass(createNVPTXLowerAllocaPass());
addPass(createNVPTXFavorNonGenericAddrSpacesPass());		addPass(createNVPTXFavorNonGenericAddrSpacesPass());
// FavorNonGenericAddrSpaces shortcuts unnecessary addrspacecasts, and leave		// FavorNonGenericAddrSpaces shortcuts unnecessary addrspacecasts, and leave
// them unused. We could remove dead code in an ad-hoc manner, but that		// them unused. We could remove dead code in an ad-hoc manner, but that
// requires manual work and might be error-prone.		// requires manual work and might be error-prone.
addPass(createDeadCodeEliminationPass());		addPass(createDeadCodeEliminationPass());

		// === Straight-line scalar optimizations ===
addPass(createSeparateConstOffsetFromGEPPass());		addPass(createSeparateConstOffsetFromGEPPass());
addPass(createSpeculativeExecutionPass());		addPass(createSpeculativeExecutionPass());
// ReassociateGEPs exposes more opportunites for SLSR. See		// ReassociateGEPs exposes more opportunites for SLSR. See
// the example in reassociate-geps-and-slsr.ll.		// the example in reassociate-geps-and-slsr.ll.
addPass(createStraightLineStrengthReducePass());		addPass(createStraightLineStrengthReducePass());
// SeparateConstOffsetFromGEP and SLSR creates common expressions which GVN or		// SeparateConstOffsetFromGEP and SLSR creates common expressions which GVN or
// EarlyCSE can reuse. GVN generates significantly better code than EarlyCSE		// EarlyCSE can reuse. GVN generates significantly better code than EarlyCSE
// for some of our benchmarks.		// for some of our benchmarks.
if (getOptLevel() == CodeGenOpt::Aggressive)		addEarlyCSEOrGVNPass();
addPass(createGVNPass());
else
addPass(createEarlyCSEPass());
// Run NaryReassociate after EarlyCSE/GVN to be more effective.		// Run NaryReassociate after EarlyCSE/GVN to be more effective.
addPass(createNaryReassociatePass());		addPass(createNaryReassociatePass());
// NaryReassociate on GEPs creates redundant common expressions, so run		// NaryReassociate on GEPs creates redundant common expressions, so run
// EarlyCSE after it.		// EarlyCSE after it.
addPass(createEarlyCSEPass());		addPass(createEarlyCSEPass());

		// === LSR and other generic IR passes ===
		TargetPassConfig::addIRPasses();
		// EarlyCSE is not always strong enough to clean up what LSR produces. For
		// example, GVN can combine
		//
		// %0 = add %a, %b
		// %1 = add %b, %a
		//
		// and
		//
		// %0 = shl nsw %a, 2
		// %1 = shl %a, 2
		//
		// but EarlyCSE can do neither of them.
		addEarlyCSEOrGVNPass();
}		}

bool NVPTXPassConfig::addInstSelector() {		bool NVPTXPassConfig::addInstSelector() {
const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();		const NVPTXSubtarget &ST = *getTM<NVPTXTargetMachine>().getSubtargetImpl();

addPass(createLowerAggrCopies());		addPass(createLowerAggrCopies());
addPass(createAllocaHoisting());		addPass(createAllocaHoisting());
addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));		addPass(createNVPTXISelDag(getNVPTXTargetMachine(), getOptLevel()));
▲ Show 20 Lines • Show All 88 Lines • Show Last 20 Lines