This patch fixes PR25108 and "fails" a couple of testcases (see below), but is yet to help a real workload to justify committing. Posted in hope it may become useful.
The patch identifies scalar instructions that lie on data dependence cycles and boosts the cost of SLP tree entries that contain such scalar instructions, thereby preventing it from being vectorized. A more precise solution would check how much of the latency across such dependence cycles is hidden by ILP, considering the VF.
The patch contains 3 parts, which may be of independent interest:
- Resurrecting DataFlow.h that was deleted by Chandler two years ago for lack of use (r202825). It is employed here to provide a use-def dependence graph over Values. In doing so we’re ignoring memory dependences, which do not appear in this PR, but may require attention in the future. The appropriate location is IR/DataFlow.h rather than under Support; this patch simply resurrects the file, to be moved in a separate commit if desired.
- Enhancing scc_iterator in order to iteratively compute strongly-connected components in the data-dependence graph, starting from the scalar instructions at the root of the SLP tree (e.g., a vector of stores). The current scc_iterator works from a single entry node; if multiple entry nodes are to be scanned, an artificial entry node can be used as done in FunctionAttrs.cpp/SyntheticRoot. In our case we want to repeatedly start from a subset of entry nodes, record the nodes found to lie on cycles, and refrain from scanning parts of the graph multiple times. This is accomplished by providing scc_iterator with an optional argument holding nodes that have been visited already and should not be revisited. This extension, which follows Tarjan’s original algorithm, can be used to simplify FunctionAttrs.cpp; doing so deserves a separate patch.
- The SLP vectorizer adjusts the cost of each to-be-vector-instruction that contains a cyclic scalar instruction. We try to save compile time by computing SCC’s only for SLP trees that are about to be vectorized based on their original total cost, and recalculate this cost on demand.
The patch causes 3 tests to fail as they no longer get vectorized: Transforms/SLPVectorizer/X86/{cycle_dup.ll, external_user.ll, phi.ll}. Testing the first reveals that it’s indeed 15% faster with -fno-slp-vectorize. To circumvent the cost model from interfering with these tests, they should include -force-vector-width=4, but that only applies to the loop vectorizer. Forcing the slp vectorizer may deserve separate attention.