Page MenuHomePhabricator

Introduce bitset metadata format and bitset lowering pass.
ClosedPublic

Authored by pcc on Jan 29 2015, 6:42 PM.

Details

Summary

This patch introduces a new mechanism that allows IR modules to co-operatively
build pointer sets corresponding to addresses within a given set of
globals. One particular use case for this is to allow a C++ program to
efficiently verify (at each call site) that a vtable pointer is in the set
of valid vtable pointers for the class or its derived classes. One way of
doing this is for a toolchain component to build, for each class, a bit set
that maps to the memory region allocated for the vtables, such that each 1
bit in the bit set maps to a valid vtable for that class, and lay out the
vtables next to each other, to minimize the total size of the bit sets.

The patch introduces a metadata format for representing pointer sets, an
'@llvm.bitset.test' intrinsic and an LTO lowering pass that lays out the globals
and builds the bitsets, and documents the new feature.

Design discussion: http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-January/081285.html

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
silvas edited edge metadata.Jan 30 2015, 2:39 AM

My overarching concern here is extending the IR and bitcode for a feature whose utility hasn't been demonstrated. There doesn't seem to be consensus on llvmdev that this feature is worth the maintenance cost.

pcc updated this revision to Diff 19064.Jan 30 2015, 3:13 PM
pcc edited edge metadata.
  • Correctly handle multiple calls to llvm.bitset.test for a single bitset
pcc updated this revision to Diff 19281.Feb 3 2015, 4:00 PM
  • Use metadata instead of introducing the bitset global feature
  • Move bitcode format updates to separate patch
kcc added a comment.Feb 3 2015, 4:32 PM

Do you want to also change the patch summary?

pcc retitled this revision from Introduce bitset global attribute and bitset lowering pass. to Introduce bitset metadata format and bitset lowering pass..Feb 3 2015, 4:39 PM
pcc updated this object.

Do you want to also change the patch summary?

Done.

pcc added a comment.Feb 3 2015, 9:35 PM

Unfortunately I've found that this won't work if a pass (such as globaldce) calls removeDeadConstantUsers on the vtable global, as this will cause the getelementptr to be deleted and replaced with null in the metadata.

I'll try to fix this by changing the metadata format to be of the form:

{ bitset name, global, byte offset }
pcc updated this revision to Diff 19301.Feb 3 2015, 10:18 PM
  • Switch to metadata format { bitset name, global, byte offset } to avoid removeDeadConstantUsers issue
kcc added a comment.Feb 4 2015, 11:45 AM

Two comments on the doc so far.

docs/LangRef.rst
1866 ↗(On Diff #19301)

two elements... and the third is ...

1889 ↗(On Diff #19301)

I'd suggest you to have an example that has a non-zero offset (otherwise it's not clear why you need offsets)

pcc added inline comments.Feb 4 2015, 11:56 AM
docs/LangRef.rst
1866 ↗(On Diff #19301)

Done.

1889 ↗(On Diff #19301)

Done.

pcc updated this revision to Diff 19343.Feb 4 2015, 11:56 AM
  • Documentation fixes

Can you please break the docs for this into a separate document? This is what we have been doing lately for "large" new IR features, like stackmaps, statepoints and inalloca. I think this feature qualifies. The separate document format gives a bit more breathing room for discussing the design and rationale of the feature, along with providing more in-depth explanation and examples.

kcc added a comment.Feb 4 2015, 1:22 PM

+1 for a separate doc file

pcc updated this revision to Diff 19364.Feb 4 2015, 5:12 PM
  • Move documentation to separate page
kcc added inline comments.Feb 4 2015, 6:57 PM
lib/Transforms/IPO/LowerBitSets.cpp
44 ↗(On Diff #19364)

} // namespace

54 ↗(On Diff #19364)

I wonder if you could split this 300 LOC function into several smaller ones.

jfb added a comment.Feb 5 2015, 10:41 AM

I haven't looked at the big function or tests in-depth yet, agreed with @kcc that breaking up the function would make it easier to review.

Could you also add stats to the pass? I'd like Chrome to be able to have a sanity check test that this pass does something.

lib/Transforms/IPO/LowerBitSets.cpp
58 ↗(On Diff #19364)

Shouldn't it be an error to try to run the bit set test pass without a module containing the llvm.bitset.test intrinsic?

75 ↗(On Diff #19364)

What other valid uses of the intrinsic are there?

89 ↗(On Diff #19364)

I'm not sure I get what's going on here :-)

181 ↗(On Diff #19364)

UNINT64_MIN or std::numeric_limits<uint64_t>::min() is nicer.

pcc updated this revision to Diff 19425.Feb 5 2015, 1:07 PM
  • Address review comments
lib/Transforms/IPO/LowerBitSets.cpp
44 ↗(On Diff #19364)

Done.

54 ↗(On Diff #19364)

Done.

58 ↗(On Diff #19364)

This pass should work with any module, including modules that happen not to contain calls to the intrinsic.

75 ↗(On Diff #19364)

There shouldn't be any (or at least the verifier should have errored on them). I've changed this to a cast.

89 ↗(On Diff #19364)

I've added a comment which should hopefully clarify things here.

181 ↗(On Diff #19364)

(I think you mean ::max) Done.

pcc updated this revision to Diff 19426.Feb 5 2015, 1:11 PM
  • s/constants/metadata/
pcc updated this revision to Diff 19427.Feb 5 2015, 1:19 PM
  • Add statistics
kcc added a comment.Feb 5 2015, 3:50 PM

Few comments in random places, more to go

lib/Transforms/IPO/LowerBitSets.cpp
10 ↗(On Diff #19427)

are we going to handle InvokeInst?
Check llvm/IR/CallSite.h

55 ↗(On Diff #19427)

SmallVector maybe

106 ↗(On Diff #19427)

I'd prefer if we report_fatal_error w/o DL, just like in AddressSanitizer.cpp

kcc added a comment.Feb 5 2015, 3:51 PM

Few comments in random places, more to go

kcc added inline comments.Feb 5 2015, 4:10 PM
lib/Transforms/IPO/LowerBitSets.cpp
115 ↗(On Diff #19427)

I wonder if this thing is unit-testable separately?
We are not doing these things frequently, but I think we should.
E.g. see ASanStackFrameLayoutTest.cpp

121 ↗(On Diff #19427)

SmallVector, here and maybe in other places.

210 ↗(On Diff #19427)

Clever!

222 ↗(On Diff #19427)

Do we really want a separate BB here?
On valid programs, i.e. in 99.9999% cases both checks will pass, so it might be better to do this w/o branches.

pcc updated this revision to Diff 19506.Feb 6 2015, 12:42 PM

This also introduces a couple of optimizations which yielded a binary size
improvement of about 0.5% in a certain medium size (~14MB) binary.

  • Enforce availability of data layout
  • Check if the pointer is statically a member of the bitset; if so, remove the check
  • Add fast check for case where bitset contains single element
  • Use SmallVector
pcc added inline comments.Feb 6 2015, 12:44 PM
lib/Transforms/IPO/LowerBitSets.cpp
10 ↗(On Diff #19427)

It isn't normally possible to invoke an intrinsic -- I think this is only possible for two specific intrinsics. In any case, this intrinsic cannot unwind.

55 ↗(On Diff #19427)

I don't think we can make any prediction one way or another how large these are going to get. For vptr CFI it depends on the number of virtual calls made using a particular class, which is essentially unbounded. In the absence of any evidence I'd prefer to leave this as a std::vector.

106 ↗(On Diff #19427)

Done.

115 ↗(On Diff #19427)

Most likely. I think this already has enough test coverage for the moment though.

121 ↗(On Diff #19427)

Done. This is probably better as a SmallVector since it lives on the stack and we can probably make an educated guess on its size.

222 ↗(On Diff #19427)

It seems better to have separate BBs for the llvm.bitset.test intrinsic at least. Without the separate BBs, we could segv on out-of-bounds addresses, and it's possible that the caller might always need the boolean result (say if it wanted to print a diagnostic message if the check fails). If we do introduce intrinsics that are required to abort the program if the address is out of bounds, we could probably do both checks in the same BB, as either way the program would abort (but it would be harder to distinguish CFI failures from other types of crashes).

kcc added inline comments.Feb 6 2015, 3:11 PM
lib/Transforms/IPO/LowerBitSets.cpp
168 ↗(On Diff #19506)

do you need M here?

197 ↗(On Diff #19506)

I'd split this function into two at this point.
Then the second part of the function will become easily unit-testable.

201 ↗(On Diff #19506)

This part is the key to understanding what we are doing here.
It would be nice to have examples of input (array of offsets) and output (bitset).
But then, a unit tests would be a better example than comments.
I do not agree with you here that we have enough test coverage and hence do not need more tests.
unit tests are not necessarily required only for test coverage -- they are also a good way to provide examples.

293 ↗(On Diff #19506)

the code gen seems to generate the BT instruction sometimes.

It's may be better to just create an instruction sequence that will be lowered to BT.
Example:

void foo(unsigned char a, unsigned long b) {

if (a & (1 << b))
  __builtin_trap();

}

0:	0f a3 f7             	bt     %esi,%edi
3:	72 01                	jb     6 <_Z3foohm+0x6>
5:	c3                   	retq   
6:	0f 0b                	ud2
pcc added inline comments.Feb 6 2015, 3:26 PM
lib/Transforms/IPO/LowerBitSets.cpp
168 ↗(On Diff #19506)

No, I'll remove it.

197 ↗(On Diff #19506)

Makes sense, I'll do that.

201 ↗(On Diff #19506)

Fair enough, I'll see if I can add a few unit tests.

293 ↗(On Diff #19506)

Yes, we can do that. I did try before to emit the register variant of BT but I guess I wasn't using the correct pattern. I'll play around with the IR to see if I can get it to match.

I was also thinking more of the memory variant of BT that basically does exactly what we want given a bit offset.

pcc updated this revision to Diff 19616.Feb 9 2015, 2:33 PM
  • Use register variant of bt instruction on x86
  • Move bitset lowering pass before simplifycfg
  • If the bit set is sufficiently small, we can avoid a load by bit testing a constant
  • Update test case
  • Move part of implementation to testable location
  • Add unit tests for BitSetBuilder
  • Remove M parameter
pcc added a comment.Feb 9 2015, 2:34 PM

Also implemented an optimization for bit sets of size <= 64 bits.

lib/Transforms/IPO/LowerBitSets.cpp
168 ↗(On Diff #19506)

Done.

197 ↗(On Diff #19506)

Done.

201 ↗(On Diff #19506)

Done.

293 ↗(On Diff #19506)

This now uses the register BT.

pcc updated this revision to Diff 19849.Feb 12 2015, 11:33 AM
  • Update comment with benchmark results for memory bt
kcc added inline comments.Feb 18 2015, 2:49 PM
lib/Transforms/IPO/LowerBitSets.cpp
175 ↗(On Diff #19849)

You can get all of these, except for IntPtrTy, from the IRBuilder. Just FYI, this way is also fine, I think.

189 ↗(On Diff #19849)

I would add comments for every non-trivial function.
In most cases, I understand what these things are doing, other readers (including myself in 6 months) will probably be puzzled.

292 ↗(On Diff #19849)

split into a separate function?

pcc updated this revision to Diff 20252.Feb 18 2015, 6:44 PM
  • Remove Int8Ty field.
  • Add function comments
  • Factor out some code into a function
lib/Transforms/IPO/LowerBitSets.cpp
189 ↗(On Diff #19849)

Done

292 ↗(On Diff #19849)

Done

jfb added inline comments.Feb 19 2015, 10:26 AM
lib/Transforms/IPO/LowerBitSets.cpp
42 ↗(On Diff #20252)

1 should technically be a uint64_t too, though if the current code triggered UB things would be awful in other ways!

236 ↗(On Diff #20252)

I'm mildly disappointed that the optimizer doesn't do this by taking into account ISA-specific sizes (and then removing the dead global because its address isn't taken).

pcc updated this revision to Diff 20321.Feb 19 2015, 11:22 AM
  • Use a 64-bit shift, add unit tests for containsGlobalOffset
lib/Transforms/IPO/LowerBitSets.cpp
42 ↗(On Diff #20252)

Thanks, fixed and added unit tests for this function.

236 ↗(On Diff #20252)

This should in principle be possible, but this pass runs late so in any case it seems best to directly generate the IR we need.

jfb added inline comments.Feb 19 2015, 12:51 PM
lib/Transforms/IPO/LowerBitSets.cpp
236 ↗(On Diff #20252)

Does it need to run that late? 8 may not be the right number on all architectures, and you should see a good binary size reduction by GC'ing unreferenced globals (especially if CFI+devirtualization occurs).

pcc added inline comments.Feb 19 2015, 1:45 PM
lib/Transforms/IPO/LowerBitSets.cpp
236 ↗(On Diff #20252)

It seemed better to run the pass later because it splits basic blocks, which could pessimize things in other passes, and bitset creation essentially locks in the set of virtual tables that appear in the binary, preventing them from being GCd.

I'm not sure what the best way to deal with this might be. We might later want to split this pass into an early pass and a late pass, where the early pass uses bitset metadata to do devirtualization, a later globalopt pass GCs unused vtables and the late pass builds the actual bitsets.

jfb added inline comments.Feb 19 2015, 1:59 PM
lib/Transforms/IPO/LowerBitSets.cpp
236 ↗(On Diff #20252)

Maybe I have a naive view, but shouldn't bitsets be GC'able if no test refers to them? This doesn't need to be an optimization that's specific to your code, LLVM can do this in general when a global doesn't escape and isn't address-taken (and in your case, is read-only). If this is correct, then I don't think you need to split up this pass, though I agree that you may want to do devirtualization earlier to expose more optimization opportunities.

Under the current setup, do redundant tests in the same function get eliminated and control flow merged?

This may be something that we can leave open for later changes: I think the current code is good in that it does what's required and is pretty efficient at it. I don't think the design will change substantially, but I do think there are further optimization opportunities here. WDYT?

pcc added inline comments.Feb 19 2015, 2:59 PM
lib/Transforms/IPO/LowerBitSets.cpp
236 ↗(On Diff #20252)

Maybe I have a naive view, but shouldn't bitsets be GC'able if no test refers to them?

They could be, but the globals that the bitsets map onto (i.e. the vtables) cannot be GC'd because we lay them out in a specific order in this pass.

We only build bitset constants for bitsets that are referred to by tests. The loop near the start of LowerBitSets::buildBitSets identifies all such bitsets by looking through the uses of the llvm.bitset.test intrinsic. If a particular test is dead, LLVM should equally be able to remove the dead test (as it is a readonly intrinsic) or remove a dead load from a bitset as part of DCE (in fact the former would probably be easier because of the simpler control flow).

Another advantage of doing this late is that allowing the earlier passes to eliminate dead tests we potentially reduce the number of equivalence classes we need to create, which could result in smaller disjoint sets of classes and therefore smaller bitsets.

Under the current setup, do redundant tests in the same function get eliminated and control flow merged?

Are you referring to cases where a virtual call happens twice through the same pointer?

struct S {
  virtual void f();
};

[...]
S *p = ...;
p->f();
p->f();

The problem is that it will be difficult to remove redundant tests because of the semantics of C++. In this case the function f could overwrite the memory region that p refers to with an object of a different derived class without invoking undefined behavior. We might want a flag that a user can use to promise that such things will never happen though.

I don't think the design will change substantially, but I do think there are further optimization opportunities here. WDYT?

At a high level I do agree that there are optimization opportunities to pursue here. (I could elaborate, but this probably isn't the place.)

jfb added a comment.Feb 19 2015, 3:16 PM

================
Comment at: lib/Transforms/IPO/LowerBitSets.cpp:236
@@ +235,3 @@
+ Value *BitOffset) {
+ if (BSI.Bits.size() <= 8) {
+ // If the bit set is sufficiently small, we can avoid a load by bit
testing


> Maybe I have a naive view, but shouldn't bitsets be GC'able if no test
refers to them? This doesn't need to be an optimization that's specific to
your code, LLVM can do this in general when a global doesn't escape and
isn't address-taken (and in your case, is read-only). If this is correct,
then I don't think you need to split up this pass, though I agree that you
may want to do devirtualization earlier to expose more optimization
opportunities.
>
> Under the current setup, do redundant tests in the same function get
eliminated and control flow merged?
>
> This may be something that we can leave open for later changes: I think
the current code is good in that it does what's required and is pretty
efficient at it. I don't think the design will change substantially, but I
do think there are further optimization opportunities here. WDYT?
> Maybe I have a naive view, but shouldn't bitsets be GC'able if no test
refers to them?

They could be, but the globals that the bitsets map onto (i.e. the
vtables) cannot be GC'd because we lay them out in a specific order in this
pass.

We only build bitset constants for bitsets that are referred to by tests.
The loop near the start of LowerBitSets::buildBitSets identifies all such
bitsets by looking through the uses of the llvm.bitset.test intrinsic. If
a particular test is dead, LLVM should equally be able to remove the dead
test (as it is a readonly intrinsic) or remove a dead load from a bitset as
part of DCE (in fact the former would probably be easier because of the
simpler control flow).

OK, that's what I was hoping happens (I was afraid the optimization was
running too early to be able to do this).

Another advantage of doing this late is that allowing the earlier passes to

eliminate dead tests we potentially reduce the number of equivalence
classes we need to create, which could result in smaller disjoint sets of
classes and therefore smaller bitsets.

Agreed.

Under the current setup, do redundant tests in the same function get
eliminated and control flow merged?

Are you referring to cases where a virtual call happens twice through the
same pointer?

Yes, or any case where the test intrinsic is redundant (doesn't have to be
the same f).

The problem is that it will be difficult to remove redundant tests because

of the semantics of C++. In this case the function f could overwrite the
memory region that p refers to with an object of a different derived class
without invoking undefined behavior. We might want a flag that a user can
use to promise that such things will never happen though.

Good point, I hadn't thought through that. It may be worth adding to the
design doc?

I don't think the design will change substantially, but I do think there
are further optimization opportunities here. WDYT?

At a high level I do agree that there are optimization opportunities to
pursue here. (I could elaborate, but this probably isn't the place.)

OK, overall this lgtm, I think we're on the same page w.r.t. potential
optimizations.

kcc added a comment.Feb 19 2015, 3:31 PM

LGTM

I think at this point it will become easier to continue reviews/improvements if the code is committed.
Let's give one more day to other reviewers -- if you don't hear significant objections please commit tomorrow.

lib/Transforms/IPO/LowerBitSets.cpp
362 ↗(On Diff #20321)

remove {}

This revision was automatically updated to reflect the committed changes.