This is an archive of the discontinued LLVM Phabricator instance.

[tablegen] Improve performance of -gen-register-info by replacing barely-necessary std::map with a sorted vector
ClosedPublic

Authored by dsanders on Aug 3 2018, 2:33 PM.

Details

Summary

This particular map is hardly ever queried and has a phased usage pattern (insert,
iterate, query, insert, iterate) so it's a good candidate for a sorted vector and
std::lower_bound.

This significantly reduces the run time of runTargetDesc() in some circumstances.
One llvm-tblgen invocation in my build improves the time spent in runTargetDesc()
from 9.86s down to 0.80s (~92%) without changing the output. The same invocation
also has 2GB less allocation churn.

Diff Detail

Event Timeline

dsanders created this revision.Aug 3 2018, 2:33 PM
rtereshin requested changes to this revision.Aug 3 2018, 3:51 PM

This is an impressive result, thanks for doing this!

include/llvm/ADT/STLExtras.h
754

Maybe it makes sense to do this not_fn-style (https://en.cppreference.com/w/cpp/utility/functional/not_fn) so it could work with function pointers and lambda-functions, not just struct-like function objects with trivial constructors.

Of course it makes sense to do that only when actually needed, not necessarily now.

utils/TableGen/RegisterInfoEmitter.cpp
356

I feel like

DwarfRegNums.emplace_back(Reg, std::move(RegNums));  // I like to move it, move it

is going to make sense here: https://godbolt.org/g/rNN15d

361

I think instead of doing this std::reverse we could just do std::unique below on rbegin/rend instead of begin/end and achieve the same result with less memory churn.

376

Looks like the original std::map has uniqued the records by !LessRecordRegister()(Reg, LastSeenReg) && !LessRecordRegister()(LastSeenReg, Reg), not by a pointer (see https://en.cppreference.com/w/cpp/container/map).

Maybe we could add another function object to ADT/STLExtras.h, something like

template<typename LessTy>
struct equivalent {
  LessTy less;

  template <typename A, typename B> bool operator()(A &&a, B &&b) const {
    return !less(std::forward<A>(a), std::forward<B>(b)) &&
           !less(std::forward<B>(b), std::forward<A>(a);
  }
};

(give or take) so we could call std::unique(DwarfRegNums.begin(), DwarfRegNums.end(), on_first<equivalent<LessRecordRegister>>());

More efficient way of achieving the same in this specific case I think is bringing in C++17's not_fn (https://en.cppreference.com/w/cpp/utility/functional/not_fn) and doing std::unique(DwarfRegNums.begin(), DwarfRegNums.end(), not_fn(on_first<LessRecordRegister>()));: if a is not less than b than a is equivalent b as the vector is sorted and b not less than a is already guaranteed.

453

I think this immediately breaks the "sorted" invariant required for lower_bound to work correctly. I think we need to remember the size of the DwarfRegNums before the loop and do std::lower_bound(DwarfRegNums.begin(), DwarfRegNums.begin() + DwarfRegNumsSize, ...) instead of std::lower_bound(DwarfRegNums.begin(), DwarfRegNums.end(), ... to fix this. The std::sort after the loop will be of course still required.

This revision now requires changes to proceed.Aug 3 2018, 3:51 PM
dsanders added inline comments.Aug 3 2018, 6:24 PM
include/llvm/ADT/STLExtras.h
754

I've just had a look at the implementation of not_fn, and I think it's overkill for what we need it for. This implementation is consistent with the less_first and deref functions defined earlier in this file.

utils/TableGen/RegisterInfoEmitter.cpp
356

Good point. I'll make this change shortly

361

I agree. I was also wondering if the reverse is necessary to begin with, I've only put it there to preserve behaviour. None of the in-tree targets are emitting this warning and it seems fairly arbitrary to pick the last definition over the first. I'm pretty sure we can delete it

Unfortunately, while unique() is happy with a reverse_iterator, you can't give the result to erase()

376

Looking at this code again, there's a few things that are questionable about it (and a couple we haven't already discussed off-list).

LessRecordRegister is sorting by Record::getName(). The order is a bit weird but that's not particularly important here*, the important bit is that it's an order over Record::getName(). Record::getName() is unique for each Record since tablegen requires unique names and will synthesize unique names them for anonymous records. That would make it impossible to have duplicates were it not for RegisterTuple<>. RegisterTuple<> synthesizes new Record objects and derives their name from the subregisters. As a result, duplicate (add R1, R2) sequences in a RegisterTuple will break the uniqueness of the Record names.

However, the comments for RegisterTuple says "Most fields from the Register class are inferred, and the AsmName and Dwarf numbers are cleared." so RegisterTuple<> to RegisterTuple<> conflicts don't matter since the DwarfRegNum vector will be the same anyway. The only case I can see being a problem is when a Record synthesized from a RegisterTuple<> conflicts with a user-provided Record.


* At first glance, it looks like it's a a normal comparison except that it sorts numbers numerically, but it actually compares all the alpha components first, then all the numeric components. It also compares the length on the numeric strings before the value. As a result "A2A" < "A2B" and "A2A" < "A01A"

453

Well spotted. Looking at this again, there's another thing I missed on the first pass. This is handling the DwarfAlias field which can only be set for user-defined registers (and not the synthesized ones). We already added all the registers to the map on line 356 so we don't need to emplace. We should replace the existing pair instead

dsanders updated this revision to Diff 159155.Aug 3 2018, 7:11 PM

Use std::move
Fix the comparison by pointer (but still avoid redundant use of LessRecordRegister since it's expensive compared to what we need)
Fix the lower_bound issue

I'm going to follow up with a couple tidy-ups since the extra code has got a bit large

rtereshin accepted this revision.Aug 3 2018, 9:13 PM

LGTM with minor nitpicks, please feel free to make changes you'll find reasonable while committing.

utils/TableGen/RegisterInfoEmitter.cpp
361

So do you want to delete it?

Unfortunately, while unique() is happy with a reverse_iterator, you can't give the result to erase()

Sadly true, didn't expect that.

454

I think we can get rid of this second lower_bound call if we iterate over DwarfRegNums, not over Regs.

460

Why not just RegIter->second = AliasIter->second;?

This revision is now accepted and ready to land.Aug 3 2018, 9:13 PM
rtereshin added inline comments.Aug 3 2018, 9:26 PM
utils/TableGen/RegisterInfoEmitter.cpp
455

Ah, missed an easy one: this sort is not needed anymore as we don't insert anything into DwarfRegNums anymore in the loop above AFAICT.

dsanders updated this revision to Diff 159591.Aug 7 2018, 1:53 PM
dsanders marked 7 inline comments as done.

Fixed most of the nits. Left out the one about iterating over DwarfRegNums since it would make the patch noisy.

Will re-generate the numbers before committing

dsanders edited the summary of this revision. (Show Details)Aug 7 2018, 4:49 PM
This revision was automatically updated to reflect the committed changes.
RKSimon added a subscriber: RKSimon.Aug 8 2018, 2:19 AM

@dsanders I'm curious how you noticed this particular case - were you just going through perf traces, noticed the hit and manually worked out how the map was being poorly used?

My guess is that there might be more cases like this where the choice of data structure wasn't optimal and I'm starting to wonder if there are ways we could instrument them all to identify usage patterns etc.

@dsanders I'm curious how you noticed this particular case - were you just going through perf traces, noticed the hit and manually worked out how the map was being poorly used?

Pretty much. Our build bottlenecks on some of the tablegen passes and every so often I take a look at them in Instruments and see if I can spot anything in the trace that ought to be looked into. The Time Profiler is useful for this but it's not very good at pinpointing a problem because it's a statistical profiler (High Frequency mode makes a big difference though). I found this one through the Allocation Density graph on the Allocations track. There was a long spike in allocations that wasn't coupled with an increase in memory usage. I selected that region in the timeline and looked at the detailed graphs and found that the allocations were mostly the same size. I then looked at the allocations for that size and found they all came from inserting into the map.

In the traces I'm looking at normalizeWeight() in CodeGenRegisters.cpp is by far the dominating factor in terms of time but there doesn't seem to be any simple improvements. I need to look into the details of what it's doing and how but looking at it from a high level, the main observation I have is that there's generally only a few register/sub-register topologies even if there are lots of registers. If we can calculate the weights for those topologies and then apply them to the relevant registers then that should be a significant saving.

My guess is that there might be more cases like this where the choice of data structure wasn't optimal and I'm starting to wonder if there are ways we could instrument them all to identify usage patterns etc.

That's an interesting idea. We may be able to make a custom instrument that can reveal phased usage (https://developer.apple.com/videos/play/wwdc2018/410/)