The buildSchedGraph() was in need of reworking as the AA features had been added on top of earlier code, and it was very difficult to undertand, and buggy.
What was really awkward before was that a dependency would be checked, and then the potential successor could instead of becoming the successor be put into RejectMemNodes, while otherwise being forgotten about. It was then very difficult to know what to do to make sure later nodes actually do all the right dependency checks. RejectMemNodes was checked basically all the time, which was slow, and not so clever because in that set all information about SUs, such as mapping to Values were lost.
New design:
Basically, I have removed RejectMemNodes, adjustChainDeps(), iterateChainSucc(), and the AliasChain concept. An unknown store used to become the AliasChain, but now becomes a store mapped to 'unknownValue'. RejectMemNodes and adjustChainDeps() used to be a safety-net for everything, but this is not needed anymore because the lists of SUs mapped from any Value are not cleared.
There are now four maps: Stores, Loads, NonAliasStores and NonAliasLoads. What used to be PendingLoads is instead the list of SUs mapped from 'unknownValue' in Loads, just like unknown stores. In addition to this, there is the BarrierChain node.
To build memory dependencies, analyze each SUnit in turn, and handle the different cases of 1) global memory object, 2) unknown store, 3) unknown load and (4-7) Value mapped (NonAlias)Stores / (NonAlias)Loads. Each SUnit either becomes the BarrierChain, or is put into one of the maps. For each SUnit encountered, all the information about previous ones are still available, and the proper dependencies are easily checked.
I tried exactly this, and expected this to be significantly slower than the previous implementation, since there is no clearing of any lists or any iteration limits. I was however surprised to find that it seemed to do just as well if not better, on my out-of-tree target test suite (for VLIW architecture). I am guessing it is the brute force approach of RejectMemNodes that is the reason for this.
test suite totals:
buildSchedGraph() MIsNeedChainEdge() calls AA queries Num Edges in DAG (compile time clang)
Old 738465 249456 2829228 (1785.37s)
New 394762 183551 2848443 (1599.43s)
AA queries are those from within MIsNeedChainEdge(). Compile time is for clang invocation, not just the scheduler pass time, and please note as usual that the time results preliminary. The code output was practically identical. Similar results was gotten without AA. The old implementation seemed to also have the worst single compile time.
My main requirements are [for new implementation (Code owner)]
- AA-aware dependencies are still off by default
- default behavior is (obviously) not impacted
- default compile time does not significantly increase
I think the new implementation meets these requirements, but I do not consider my patch to be necessarily ready to commit. Rather, I would like to have some feedback from various target owners which should evaluate it, and then of course I would appreciate any opinions and suggestions for improvements. The point I feel most unsure about is how everyone wants to deal with compile time and huge regions, see below.
Note that test/CodeGen/PowerPC/vec-abi-align.ll currently fails with this patch.
Regarding search bounds and huge regions:
The scheduler will have to compromise with huge regions, although this should be realtively rare. This used to be handled in iterateChainSucc() with a limit on depth. It seems that the general motto so far has been to let the scheduler do its work without time limits on any reasonable code, since depth < 200 is very generous and is loosing to this new implementation, even as it is unbounded.
I have begun to experiment with this, and in my patch I have set a limit on the size on each of the four maps (Stores, Loads, NonAliasStores and NonAliasLoads). This is achieved by reducing the lists contained in a map in order of size. The limit is very high by default (500), but can be controlled with the option -max-build-sched-nodes. The map is reduced in size to 3/4 of its original size when this limit is reached. I found that I could set it to 50 while noticing nearly nothing in the output, and this was an improvement in clang total compile time by ~3% compared to value of 500 (or ~6% compared to original implementation). Of course, this was probably due to just a few big test cases out of the many.
There are different approaches to this, and I list some additional ideas for everyone to consider:
- Call addPred() instead of addChainDependency() after some limit of number of calls to addChainDependency(). This would be most similar to the old version.
- Set a limit on each list size in a map, instead of on the whole map. A bit simpler.
- Very simplistic: make next memory access BarrierChain after a certain number of memory accesses have been analyzed. This might suffice if all we want is to guard against compile time explosion in very rare, super huge cases.
- Add one more level of mapping: Value -> TargetRegisterAnalysisResult -> SUs. I suspect that some lists grow very big, but could be split by the target. I for instance had a list of +100 unknown stores, which were all 'no-alias' due to calls to TII->areMemAccessesTriviallyDisjoint(). Instead of reducing this list when it grows too large, it would be better to split it into a map with the key of something like a TargetMemAccessAnalysisResult {register, offset}.
Indent?