The purpose of the custom linked list was to optimize for the case
of a single-element list. It turns out that TinyPtrVector handles
the same basic scenario even better, reducing the size of
LeaderTableEntry by 33%, and requiring only log2(N) allocations
as the size of the list grows. The only downside is that we have
to store the Value's and BasicBlock's in separate vectors, which
is slightly awkward in a few cases. Fortunately that ends up being
entirely encapsulated inside helper functions.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
It looks like this patch causes a crash when running llvm-test-suite: http://llvm-compile-time-tracker.com/show_error.php?commit=26fd1e7f5ce498fa4d9c28a8dd3e7235466fc03a
llvm/include/llvm/Transforms/Scalar/GVN.h | ||
---|---|---|
294 | This looks incorrect in multiple ways. If we're at the first instruction, this will step before the begin iterator. And the end iterator is invalidated in either case. I think you need something like this: VI = entry.Val.erase(VI); BI = entry.BB.erase(BI); VE = entry.Val.end(); continue; | |
llvm/lib/Transforms/Scalar/GVN.cpp | ||
2113 | Can be just return is_contained(entry.BB, BB)? |
This still crashes when building llvm-test-suite: http://llvm-compile-time-tracker.com/show_error.php?commit=77bc69cedc32296847573390cf393827b6196d1f There must be something else wrong...
llvm/lib/Transforms/Scalar/GVN.cpp | ||
---|---|---|
2109 | Hm, I think this is inverted. The old code is a bit hard to understand, but what it's checking is that all BBs are BB, not that any is. So this should be: return all_of(entry.BB, [BB](BasicBlock *EntryBB) { return EntryBB == BB; }); |
llvm/lib/Transforms/Scalar/GVN.cpp | ||
---|---|---|
2109 | You had that right in your initial version any my is_contained suggestion was wrong :) |
It looks like this change ends up being a slightly negative in terms of instruction count: http://llvm-compile-time-tracker.com/compare.php?from=bf1b81d076f89bd56e86189b013f27dcf4d73ae8&to=efa882a79c027d27f6deff14e75bd9f558dd95d0&stat=instructions
As far as I can tell, the only behavioral difference with the new version is the order of leaders in the table, which biases the candidates returned by findLeader. The hand-rolled linked listed used an unprincipled ordering: the first candidate found was always at the front of the list, but after candidates were in reverse order of discovery. The new version is currently storing them strictly in order of recovery. I can try doing a backwards search in findLeader, which would be more similar to the the old approach on average. Really, we should probably have a more principled heuristic for choosing amongst the leaders, but that seems out of scope for this change.
Compile-time still looks the same: https://llvm-compile-time-tracker.com/compare.php?from=c2eccc67ce07e9cb374eb0ecdb3038fcb8be08cd&to=1e9114984490b83d4665f12a11f84c83f50ca8f0&stat=instructions That is, mildly negative.
I think that this analysis isn't really correct:
It turns out that TinyPtrVector handles the same basic scenario even better, reducing the size of LeaderTableEntry by 33%, and requiring only log2(N) allocations as the size of the list grows.
For all practical purposes, the number of allocations in the previous implementation was zero, because a bump pointer allocator was used, which is essentially free. In the new scheme, we get real (global allocator) allocations whenever we have more than one entry.
It is true that for a single entry, the size is reduced by 33%. However, if we go to two elements, then we'll end up separately allocating two SmallVectors with 4 elements each, which means we use a total of 14 pointer-sized values instead of 6 (not counting any additional overhead the global allocator adds). So for 2 entries, we actually use more than twice as much memory. This gets better for 3 and 4 entries, and then again much worse at 5 entries, where we leave behind unused inline SmallVector storage.
I don't know what the actual distribution of the number of entries is, but it's not really clear whether the new implementation uses less memory or is more performant in practice.
I assume that there was some kind of motivation for making this change and you observed better resource utilization for some workload?
This looks incorrect in multiple ways. If we're at the first instruction, this will step before the begin iterator. And the end iterator is invalidated in either case. I think you need something like this: