Here are a few improvements proposed for the local cache:
- InitCache always read from per_class_[1] in the fast path. This was not ideal as we are working with per_class_[class_id]. The latter offers the same property we are looking for (eg: max_count != 0 means initialized), so we might as well use it and keep our memory accesses local to the same per_class_ element. So change InitCache to take the current PerClass as an argument. This also makes the fast-path assembly of Deallocate a lot more compact;
- Change the 32-bit Refill & Drain functions to mimic their 64-bit counterparts, by passing the current PerClass as an argument. This saves some array computations;
- As far as I can tell, InitCache has no place in Drain: it's either called from Deallocate which calls InitCache, or from the "upper" Drain which checks for c->count to be greater than 0 (strictly). So remove it there.
- Move the stats_ updates to after we are done with the per_class_ accesses in an attempt to preserve locality once more;
- Change some CHECK to DCHECK: I don't think the ones changed belonged in the fast path and seemed to be overly cautious failsafes;
- Mark some variables as const.
The overall result is cleaner more compact fast path generated code, and some
performance gains with Scudo (and likely other Sanitizers).