This is an archive of the discontinued LLVM Phabricator instance.

[FastISel] Sink more materializations to first use
AbandonedPublic

Authored by probinson on Nov 4 2020, 10:38 AM.

Details

Summary

Local values are constants or addresses that can't be folded into
the instruction that uses them. FastISel materializes these in a
"local value" area that always dominates the current insertion
point, to try to avoid materializing these values more than once
(per block).

https://reviews.llvm.org/D43093 added code to sink these local
value instructions to their first use, which has two beneficial
effects. One, it is likely to avoid some unnecessary spills and
reloads; two, it allows us to attach the debug location of the
user to the local value instruction. The latter effect can
improve the debugging experience for debuggers with a "set next
statement" feature, such as the Visual Studio debugger and PS4
debugger, because instructions to set up constants for a given
statement will be associated with the appropriate source line.

There are also some constants (primarily addresses) that could be
produced by no-op casts or GEP instructions; the main difference
from "local value" instructions is that these are values from
separate IR instructions, and therefore could have multiple users
across multiple basic blocks. D43093 avoided sinking these, even
though they were emitted to the same "local value" area as the
other instructions. The patch comment for D43093 states:

Local values may also be used by no-op casts, which adds the
register to the RegFixups table. Without reversing the RegFixups
map direction, we don't have enough information to sink these
instructions.

This patch implements sinking for these value materialization
instructions, by iterating over the RegFixups map. Usually there
aren't a lot, and building clang in Debug mode using a patched
clang showed a barely-above-the-noise 0.5% increase in build time.

The benefit to reducing spills and restores also exists; I saw a
0.3% reduction in code size for that Debug build of clang.

The original source locations were lost as part of emitting these
instructions to the local value area, but the source location of
the first use seems like a very good second best choice. I saw
another 0.3% reduction in the number of instruction bytes with
line-0 attributions after this patch; the benefit is greater
than that because only some of these instructions had been
attributed to line 0 before the patch.

One interesting effect is that some of these value instructions
previously ended up intermixed with prologue instructions (e.g.
stack homing for parameters); now that they have proper source
locations, they reliably come after the prologue.

Diff Detail

Event Timeline

probinson created this revision.Nov 4 2020, 10:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2020, 10:38 AM
probinson requested review of this revision.Nov 4 2020, 10:38 AM

+ people who reviewed D43093.

rnk added a comment.Nov 5 2020, 1:39 PM

Thanks! How much of the benefit do you think comes from looking across flush points, and how much do you think comes from handling no-op casts?

I think we should also consider scrapping the local value map or maybe flushing it after selecting each instruction. I don't think we're getting a whole lot of value from sharing local values between multiple non-call instructions. Most non-call instructions are pointer, integer, and FP arithmetic. Most integer arithmetic supports folding immediates. Pointer arithmetic may benefit from reusing a value. Consider a series of GEPs (array/field accesses) built off the same global variable loaded from the GOT.

I think we can use the same measurement Paul used to show this change was useful: flush the map after every instruction, rebuild clang with a new compiler, measure the binary size and prevalence of line 0 source locations. If size goes down and line zero locations go down, the local value map is more trouble than it's worth, or, at least, it should be limited to memoizing values per instruction. If we flush the map after every instruction, we can confidently give local values the location of the current instruction, and we don't need to do any complex sinking. We should be able to delete this code.

llvm/lib/CodeGen/SelectionDAG/FastISel.cpp
256

This looks like it's reverting rGa28e767f06d. This code was added to avoid O(n^2) complexity. Do you have an alternative solution to that issue?

I think we should also consider scrapping the local value map or maybe flushing it after selecting each instruction. I don't think we're getting a whole lot of value from sharing local values between multiple non-call instructions.

Toward the end of my work to come up with this patch, I was also thinking the value map is quite likely more trouble than it's worth. It increases register pressure, and once you start spilling the values, that ends up costing more than rematerializing would. Flushing the map on every instruction is an alternative that I thought of fairly late in this process (it has taken a couple of weeks to work out what was going on). I will probably try that experiment next week (I'm off tomorrow).

llvm/lib/CodeGen/SelectionDAG/FastISel.cpp
256

I was not aware of that, thanks for pointing it out. The need to number the entire block comes up when the value instruction is not used until after the next call. That typically happens in a case like foo(&a, bar(&b)); where we compute &a before we start doing the call to bar. The PR37010 example has many thousands of those in a row, and we flush every time we see a call, so my patch will certainly run into this problem.
On the other hand... if we could manage to preserve the original source info for the &a then we wouldn't need to do this sinking hack at all. Sinking the instruction doesn't recover the original source location, it gives us something sorta usable but not really correct. So I'd rather not sink these things at all, actually.

I prototyped a patch that would

  • call flushLocalValueMap() at the top of selectInstruction()
  • turn off the instruction-sinking stuff
  • not erase the DbgLoc applied to local value instructions

I also added statistics for the number of local values added to the map, and the number of lookups that got a hit (assumed to mean a reuse).

statno patchwith patchdelta
local values recorded46861974711277+25080 (+0.5%)
local values reused15245581283825-240733 (-16%)
reuse rate32.5%27.3%-5.2%
.text size127,537,392127,507,888-29505 (-0.02%)
Line-0 bytes9.071,7448,935,299-136445 (-1.5%)

This tells me that comparatively few values were reuses across instruction boundaries (~5%). Despite those reuses going away, overall .text size still went down, which we could assume is due to fewer spills/reloads; I don't have stats there, we'd have to see what regalloc is willing to report.
There's also a pleasant decrease in the number of bytes with line-0 attributions. I noticed in small examples that reloads at the top of a block still get line 0, which seems not unfair.

I don't have build-time stats yet, but as the patch removes the old sinking code, I'd anticipate that build times would go down.

Once I have that data, I'll clean up the prototype patch and post that.

Building Clang with Clang, adding the prototype patch improved build time by a barely perceptible 0.5%.

I'll work on making it presentable, probably post it tomorrow.

rnk added a comment.Nov 10 2020, 1:41 PM

Thanks for gathering the measurements! They seem promising.

I apologize for passing the work of measuring on to you. I think the last time I looked at this, I felt it would be easier (faster?) to code up the "local value sinking" feature than to run measurements and gather consensus to disable local value reuse. But, that was probably not the best tradeoff in the long run. Less code is better than more code.

See D91734 for an alternate approach as suggested by @rnk .

probinson abandoned this revision.Nov 25 2020, 11:40 AM

Abandoned in favor of D91734.