This is an archive of the discontinued LLVM Phabricator instance.

[FastISel] Sink local value materializations to first use
ClosedPublic

Authored by rnk on Feb 8 2018, 3:17 PM.

Details

Summary

Local values are constants, global addresses, and stack addresses that
can't be folded into the instruction that uses them. For example, when
storing the address of a global variable into memory, we need to
materialize that address into a register.

FastISel doesn't want to materialize any given local value more than
once, so it generates all local value materialization code at
EmitStartPt, which always dominates the current insertion point. This
allows it to maintain a map of local value registers, and it knows that
the local value area will always dominate the current insertion point.

One downside of the local value area is that for large basic blocks, the
live ranges of the local values may be large and we may end up spilling
more than we should.

The other downside is that local value instructions are always emitted
without a source location. This is done to prevent jumpy line tables,
but it means that the local value area will be considered part of the
previous statement. Consider this C code:

call1();      // line 1
++global;     // line 2
++global;     // line 3
call2(&global, &local); // line 4

Today we end up with assembly and line tables like this:

.loc 1 1
callq call1
leaq global(%rip), %rdi
leaq local(%rsp), %rsi
.loc 1 2
addq $1, global(%rip)
.loc 1 3
addq $1, global(%rip)
.loc 1 4
callq call2

The LEA instructions in the local value area have no source location and
are treated as being on line 1. Stepping through the code in a debugger
and correlating it with the assembly won't make much sense, because
these materializations are only required for line 4.

This is actually problematic for the VS debugger "set next statement"
feature, which effectively assumes that there are no registers live
across statement boundaries. By sinking the local value code into the
statement and fixing up the source location, we can make that feature
work. This was filed as https://bugs.llvm.org/show_bug.cgi?id=35975 and
https://crbug.com/793819.

This change is obviously not enough to make this feature work reliably
in all cases, but I felt that it was worth doing anyway because it
usually generates smaller, more comprehensible -O0 code. If people feel
that this is not worth the complexity or -O0 compile time, my next idea
is to teach clang to emit one basic block per statement. My
understanding is that fast regalloc will not register allocate across
basic blocks, and that will be enough for VS users.

I have not measured the impact on -O0 compile time yet, and would
appreciate any ideas for how to do that.

There are some special cases worth calling out in the commit message:

  1. local values materialized for phis
  2. local values used by no-op casts
  3. dead local value code

Local values can be materialized for phis, and this does not show up as
a vreg use in MachineRegisterInfo. In this case, if there are no other
uses, this patch sinks the value to the first terminator, EH label, or
the end of the BB if nothing else exists.

Local values may also be used by no-op casts, which adds the register to
the RegFixups table. Without reversing the RegFixups map direction, we
don't have enough information to sink these instructions.

Lastly, if the local value register has no other uses, we can delete it.
This comes up when fastisel tries two instruction selection approaches
and the first materializes the value but fails and the second succeeds
without using the local value.

Diff Detail

Repository
rL LLVM

Event Timeline

rnk created this revision.Feb 8 2018, 3:17 PM
aprantl added a reviewer: vsk.Feb 8 2018, 3:22 PM
hans added a comment.Feb 14 2018, 5:39 AM

This looks pretty good to me, but I don't have a lot of experience with fastisel.

One potentially naive question though: materializing at first-use sounds reasonable, in fact for instruction selection that's trying to be fast and not optimize, I'd expect materialization to happen at every use basically, but here it seems something is inserting all the materializations early so they can be shared by all uses, which they now trivially dominate. Would it be crazy to just materialize "as needed" even if it creates some redundancies? Would that make fastisel both faster and simpler?

I have not measured the impact on -O0 compile time yet, and would appreciate any ideas for how to do that.

Since it's a codegen change, I think it doesn't require any fancy input, but a large C program such as the amalgamated sqlite might do well. I tried it, and it shows your version a tiny bit slower, but not much:

$ perf stat -r50 bin/clang -O0 -c /tmp/sqlite-amalgamation-3220000/sqlite3.c 

 Performance counter stats for 'bin/clang -O0 -c /tmp/sqlite-amalgamation-3220000/sqlite3.c' (50 runs):

       2169.455035      task-clock:u (msec)       #    0.998 CPUs utilized            ( +-  0.35% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            25,921      page-faults:u             #    0.012 M/sec                    ( +-  0.01% )
     6,340,741,096      cycles:u                  #    2.923 GHz                      ( +-  0.31% )
     8,151,193,597      instructions:u            #    1.29  insn per cycle           ( +-  0.02% )
     1,845,160,382      branches:u                #  850.518 M/sec                    ( +-  0.02% )
        42,297,415      branch-misses:u           #    2.29% of all branches          ( +-  0.23% )

       2.173604157 seconds time elapsed                                          ( +-  0.35% )

$ perf stat -r50 bin/clang.rnk -O0 -c /tmp/sqlite-amalgamation-3220000/sqlite3.c 

 Performance counter stats for 'bin/clang.rnk -O0 -c /tmp/sqlite-amalgamation-3220000/sqlite3.c' (50 runs):

       2181.332026      task-clock:u (msec)       #    0.998 CPUs utilized            ( +-  0.41% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            25,912      page-faults:u             #    0.012 M/sec                    ( +-  0.01% )
     6,401,856,357      cycles:u                  #    2.935 GHz                      ( +-  0.39% )
     8,176,731,823      instructions:u            #    1.28  insn per cycle           ( +-  0.03% )
     1,851,112,997      branches:u                #  848.616 M/sec                    ( +-  0.02% )
        42,582,767      branch-misses:u           #    2.30% of all branches          ( +-  0.19% )

       2.185072378 seconds time elapsed                                          ( +-  0.41% )
llvm/lib/CodeGen/SelectionDAG/FastISel.cpp
243 ↗(On Diff #133507)

ultra nit: extra "a" at the end

I discussed alternative approaches with @echristo, and felt that this would be the best. Eric, do you want to take a look?

rnk updated this revision to Diff 134522.Feb 15 2018, 3:28 PM
  • rebase over prologue/epilogue loc change

Looks good from what I can see, though I'm not super experienced with fast-isel either. Does this work correctly in the case when fast-isel aborts while selecting a block?

rnk added a comment.Feb 26 2018, 5:20 PM

Looks good from what I can see, though I'm not super experienced with fast-isel either. Does this work correctly in the case when fast-isel aborts while selecting a block?

Thanks for reading, though. :) I believe the mid-block abort case came up during testing and is covered by the existing suite. FastISel often aborts on calls, flushes the local value area, goes to sdag to select a single call instruction, and then continues fastisel. This all seems to work. When it aborts, it flushes the local value area, which now triggers sinking.

Hi Reid,

The runtime needs to be checked but the overall approach look sensible for the problem at hand.
Now, should we go to such length to make the debug information be friendlier to some debugger, I am not sure, but I let you decide.

Random thoughts, inside of sinking the local values towards their first use, could we postpone emitting them until we finish selecting the block?

Cheers,
-Quentin

llvm/include/llvm/CodeGen/FastISel.h
568 ↗(On Diff #134522)

This does not return anything. Please fix the comment.

llvm/lib/CodeGen/SelectionDAG/FastISel.cpp
125 ↗(On Diff #134522)

Could you add a message in that assert please?

160 ↗(On Diff #134522)

The comment doesn't match what the function is doing.
From the comment alone, I would have expected that we don't look at uses.

178 ↗(On Diff #134522)

@hans gave you advices on how to do the measurements.

Do you still plan to push with that option?

188 ↗(On Diff #134522)

I have mixed feelings about the approach.
I am sensible to the debug location thing, but taking care of avoiding spilling here is kind of sketchy.

I would rather we focus on one goal, and I believe that goal should be putting sensible debug location.

What I am saying is if we want to work on the spilling case, given every ISels have this problem (SDISel and GISel are also affected), we should probably improve the fast reg allocator instead of coming up with workaround everywhere.

Could be just a comment thing and we don't mention spilling (or say that the plus side is that it gives shorter live-ranges and that helps RegAllocFast).

195 ↗(On Diff #134522)

Shouldn't this be MachineBasicBlock::reverse_iterator(EmitStartPt)?
We don't want to go pass the EmitStartPt, do we?

223 ↗(On Diff #134522)

Would be more natural to walk through the uses of DefReg.

Do we need to have the PHI uses in that list for that check to be relevant?

240 ↗(On Diff #134522)

We could postpone that build until after the phi check.
(The empty check does not need the full set to be performed.)

251 ↗(On Diff #134522)

I don't really like that we are doing DCE here.
I feel that either we should leave that to the rest of the pipeline or have a more centralized place in FastISel to do that. I mean I believe some other instructions can probably be DCE'ed so why do it here?

259 ↗(On Diff #134522)

That loop could be fairly expensive.
Assuming we have more than one local value to sink, we could number the instructions and just walk through all the users of the value and keep the one with the lowest number.

264 ↗(On Diff #134522)

I don't get why we use UsedByPhi here.
We shouldn't insert after the first terminator no matter what and this is not properly checked here.

Now, should we go to such length to make the debug information be friendlier to some debugger, I am not sure, but I let you decide.

Definitely not! That debugger constraint feels like it is designed for ancient compilers producing code and register allocating for one C statement at a time.

Luckily this change makes sense independently of the debugger used IMO.

rnk planned changes to this revision.Feb 27 2018, 5:27 PM
rnk marked 3 inline comments as done.

Thanks, I will implement the instruction numbering change tomorrow and then get new numbers. I'll upload a fresh patch in the meantime, though.

llvm/lib/CodeGen/SelectionDAG/FastISel.cpp
178 ↗(On Diff #134522)

I think I do. If someone complains that they find some performance regression, I want to be able to say "try flipping this flag, see if it gets faster".

188 ↗(On Diff #134522)

OK, I revised it to this:

// Try to sink local values down to their first use so that we can give them a
// better debug location. This has the side effect of shrinking local value
// live ranges, which helps out fast regalloc.

So, basically it's about shrinking live ranges. Avoiding spills is the regalloc's job, not ours.

195 ↗(On Diff #134522)

Ooh, good catch. Thank you for the careful review. :)

223 ↗(On Diff #134522)

These PHI uses do not appear in the use list because we are in the middle of ISel. We haven't formed the PHI instructions or their operand lists yet.

259 ↗(On Diff #134522)

That's a really good idea, I'll do that.

264 ↗(On Diff #134522)

My intention was to avoid sinking across EH_LABELs when the value is used by a PHI. Stopping at terminators unconditionally makes sense, but it weakens the assertion below that we must find some use of a local value that isn't used by a PHI.

rnk updated this revision to Diff 136206.Feb 27 2018, 5:29 PM
  • rebase
  • address minor comments
  • still needs instruction numbering work
dotdash added a subscriber: dotdash.Mar 2 2018, 8:18 AM
rnk updated this revision to Diff 137100.Mar 5 2018, 3:44 PM
  • Use InstOrderMap to cache instruction orders
rnk added a comment.Mar 5 2018, 3:54 PM

I did some timings on the sqlite3 amalgamation, and I found this adds about 0.13% to compile time. I was hoping it wouldn't be measurable, but there is a repeatable 0.13% difference in the total number of instructions retired reported by the linux perf tool.

I think we should probably remove the cl::opt in this patch and then land it more or less as is. It improves the -O0 debugging experience. If speeding up X86 ISel becomes a priority at some point, I think we'll want to spend that effort working on X86 GlobalISel so that our optimizations help -O0 and -O2 codegen.

hans added a comment.Mar 6 2018, 4:59 AM

Sounds reasonable to me.

rnk updated this revision to Diff 138445.Mar 14 2018, 1:26 PM
  • rebase
rnk added a comment.Mar 14 2018, 2:40 PM

It sounds like there is consensus, so I will go ahead and push this so we can get some testing and feedback going. Let me know if there are any regressions hiding elsewhere.

This looks like what we talked about, though I've not looked any more closely than that :)

This revision was not accepted when it landed; it landed in state Needs Review.Mar 14 2018, 2:57 PM
This revision was automatically updated to reflect the committed changes.

FWIW, LGTM now.

Thanks Reid!