This is an archive of the discontinued LLVM Phabricator instance.

ScheduleDAGInstrs::buildSchedGraph() rewritten.
AbandonedPublic

Authored by jonpa on Feb 24 2015, 2:17 AM.

Details

Reviewers
atrick
hfinkel
Summary

The buildSchedGraph() was in need of reworking as the AA features had been added on top of earlier code, and it was very difficult to undertand, and buggy.

What was really awkward before was that a dependency would be checked, and then the potential successor could instead of becoming the successor be put into RejectMemNodes, while otherwise being forgotten about. It was then very difficult to know what to do to make sure later nodes actually do all the right dependency checks. RejectMemNodes was checked basically all the time, which was slow, and not so clever because in that set all information about SUs, such as mapping to Values were lost.

New design:

Basically, I have removed RejectMemNodes, adjustChainDeps(), iterateChainSucc(), and the AliasChain concept. An unknown store used to become the AliasChain, but now becomes a store mapped to 'unknownValue'. RejectMemNodes and adjustChainDeps() used to be a safety-net for everything, but this is not needed anymore because the lists of SUs mapped from any Value are not cleared.

There are now four maps: Stores, Loads, NonAliasStores and NonAliasLoads. What used to be PendingLoads is instead the list of SUs mapped from 'unknownValue' in Loads, just like unknown stores. In addition to this, there is the BarrierChain node.

To build memory dependencies, analyze each SUnit in turn, and handle the different cases of 1) global memory object, 2) unknown store, 3) unknown load and (4-7) Value mapped (NonAlias)Stores / (NonAlias)Loads. Each SUnit either becomes the BarrierChain, or is put into one of the maps. For each SUnit encountered, all the information about previous ones are still available, and the proper dependencies are easily checked.

I tried exactly this, and expected this to be significantly slower than the previous implementation, since there is no clearing of any lists or any iteration limits. I was however surprised to find that it seemed to do just as well if not better, on my out-of-tree target test suite (for VLIW architecture). I am guessing it is the brute force approach of RejectMemNodes that is the reason for this.

test suite totals:
buildSchedGraph() MIsNeedChainEdge() calls AA queries Num Edges in DAG (compile time clang)
Old 738465 249456 2829228 (1785.37s)
New 394762 183551 2848443 (1599.43s)

AA queries are those from within MIsNeedChainEdge(). Compile time is for clang invocation, not just the scheduler pass time, and please note as usual that the time results preliminary. The code output was practically identical. Similar results was gotten without AA. The old implementation seemed to also have the worst single compile time.

My main requirements are [for new implementation (Code owner)]

  • AA-aware dependencies are still off by default
  • default behavior is (obviously) not impacted
  • default compile time does not significantly increase

I think the new implementation meets these requirements, but I do not consider my patch to be necessarily ready to commit. Rather, I would like to have some feedback from various target owners which should evaluate it, and then of course I would appreciate any opinions and suggestions for improvements. The point I feel most unsure about is how everyone wants to deal with compile time and huge regions, see below.

Note that test/CodeGen/PowerPC/vec-abi-align.ll currently fails with this patch.

Regarding search bounds and huge regions:

The scheduler will have to compromise with huge regions, although this should be realtively rare. This used to be handled in iterateChainSucc() with a limit on depth. It seems that the general motto so far has been to let the scheduler do its work without time limits on any reasonable code, since depth < 200 is very generous and is loosing to this new implementation, even as it is unbounded.

I have begun to experiment with this, and in my patch I have set a limit on the size on each of the four maps (Stores, Loads, NonAliasStores and NonAliasLoads). This is achieved by reducing the lists contained in a map in order of size. The limit is very high by default (500), but can be controlled with the option -max-build-sched-nodes. The map is reduced in size to 3/4 of its original size when this limit is reached. I found that I could set it to 50 while noticing nearly nothing in the output, and this was an improvement in clang total compile time by ~3% compared to value of 500 (or ~6% compared to original implementation). Of course, this was probably due to just a few big test cases out of the many.

There are different approaches to this, and I list some additional ideas for everyone to consider:

  • Call addPred() instead of addChainDependency() after some limit of number of calls to addChainDependency(). This would be most similar to the old version.
  • Set a limit on each list size in a map, instead of on the whole map. A bit simpler.
  • Very simplistic: make next memory access BarrierChain after a certain number of memory accesses have been analyzed. This might suffice if all we want is to guard against compile time explosion in very rare, super huge cases.
  • Add one more level of mapping: Value -> TargetRegisterAnalysisResult -> SUs. I suspect that some lists grow very big, but could be split by the target. I for instance had a list of +100 unknown stores, which were all 'no-alias' due to calls to TII->areMemAccessesTriviallyDisjoint(). Instead of reducing this list when it grows too large, it would be better to split it into a map with the key of something like a TargetMemAccessAnalysisResult {register, offset}.

Diff Detail

Event Timeline

jonpa updated this revision to Diff 20573.Feb 24 2015, 2:17 AM
jonpa retitled this revision from to ScheduleDAGInstrs::buildSchedGraph() rewritten..
jonpa updated this object.
jonpa edited the test plan for this revision. (Show Details)
jonpa added a subscriber: Unknown Object (MLST).
materi added a subscriber: materi.Feb 24 2015, 5:14 AM
hfinkel edited edge metadata.Feb 24 2015, 4:12 PM

Wow! This sounds like really great progress.

include/llvm/CodeGen/ScheduleDAGInstrs.h
78

Indent?

83

Don't add a blank line here.

147

There may be a better option compared to deriving this from std::list; for one thing, std::list tends to have terrible cache locality.

Given that the SUs are numbered, and we're limiting the depth to something in the 100s, would using a SparseSet be better (from llvm/ADT/SparseSet.h)? it has constant time clear, ordered vector-speed iteration, and constant-time find/insert/erase.

261

Line too long.

266

SUList &sulist -> SUList &SL
(or something like that)

287

unknownValue -> UnknownValue

lib/CodeGen/ScheduleDAGInstrs.cpp
116

Smaller than 0?

143

Line too long.

jonpa updated this revision to Diff 20842.Feb 27 2015, 4:43 AM
jonpa edited the test plan for this revision. (Show Details)
jonpa edited edge metadata.

Minor fixes, as requested.

Regarding SUList and use of std::list:

I have tried a few options to using std::list for SUList, but unfortunately I cannot find any faster alternative:

  • SparseSet: Quite memory expensive, considering that a map of max size MaxS, will in the worst case contain MaxS lists of SUs, each of size one but with the SparseSet::Universe of SUnits.size(). This will certainly also degrade cache performance. Furthermore, this data structure does not guarantee iteration in the order of insertion, after an element has been erased. This is needed to make repeated FIFO order reductions of the list.
  • CircularSmallVector: I tried to derive a vector from SmallVector that was reducible in FIFO order without needless re-allocation. The class uses Head and Tail indexes into the vector and instead of erasing elements, a Tail is introduced, which may wrap around after repeated calls to reduce(). It was however marginally slower than std::list.
  • SmallVector: Just using a vector was of course very slow, due to all the copying during reductions.

FIFO order reduction should be desireble, as in a region with many unknown stores and loads, it would probably be better to avoid close dependencies rather than between SUs with very different NodeNums (far apart in MBB).
It does not matter in which order the list is iterated over, and indexed access is not needed.

If anyone has a good idea on this, or wants to see my circular vector, let me know. Until then, std::list remains.

In D7850#131115, @jonpa wrote:

Minor fixes, as requested.

Regarding SUList and use of std::list:

I have tried a few options to using std::list for SUList, but unfortunately I cannot find any faster alternative:

  • SparseSet: Quite memory expensive, considering that a map of max size MaxS, will in the worst case contain MaxS lists of SUs, each of size one but with the SparseSet::Universe of SUnits.size(). This will certainly also degrade cache performance. Furthermore, this data structure does not guarantee iteration in the order of insertion, after an element has been erased. This is needed to make repeated FIFO order reductions of the list.
  • CircularSmallVector: I tried to derive a vector from SmallVector that was reducible in FIFO order without needless re-allocation. The class uses Head and Tail indexes into the vector and instead of erasing elements, a Tail is introduced, which may wrap around after repeated calls to reduce(). It was however marginally slower than std::list.
  • SmallVector: Just using a vector was of course very slow, due to all the copying during reductions.

FIFO order reduction should be desireble, as in a region with many unknown stores and loads, it would probably be better to avoid close dependencies rather than between SUs with very different NodeNums (far apart in MBB).
It does not matter in which order the list is iterated over, and indexed access is not needed.

If anyone has a good idea on this, or wants to see my circular vector, let me know. Until then, std::list remains.

Okay, fair enough. I'd think that what you really want is a linked list stored in approximately-cache-line-sized groups. Unfortunately, we don't have this data structure. What you have certainly seems like a significant improvement (std::list or not), so I'd not insist on a different data structure now.

jonpa added a comment.Feb 27 2015, 8:13 AM

Ok, great.

What remains then before I commit this, is that it gets tested on other targets than my own (out-of-tree) VLIW target.

I would be happy to give it a try if no one else has the time, but then I need detailed instructions. The easiest would be if somebody with testing / benchmarking experience with the various targets could try the patch and evaluate compile time and code output and give an okay for the patch.

Don't forget that there is the size limit on the maps to be tested for various values. 50 seemed pretty good for me, at least, and 500 was still faster than the old version :-)

In D7850#131190, @jonpa wrote:

Ok, great.

What remains then before I commit this, is that it gets tested on other targets than my own (out-of-tree) VLIW target.

I apologize for the delay, but I've not had a chance to look at this until today. Unfortunately, the patch no longer cleanly applies. Can you please rebase it?

jonpa abandoned this revision.Dec 3 2015, 12:47 AM

This issue was unfortunately continued months ago on a new review: D8705. Abandoning this one to avoid confusion.