This is an archive of the discontinued LLVM Phabricator instance.

LowerBitSets: Use byte arrays instead of bit sets to represent in-memory bit sets.
ClosedPublic

Authored by pcc on Feb 27 2015, 12:30 PM.

Details

Summary

By loading from indexed offsets into a byte array and applying a mask, a
program can test bits from the bit set with a relatively short instruction
sequence. For example, suppose we have 15 bit sets to lay out:

A (16 bits), B (15 bits), C (14 bits), D (13 bits), E (12 bits),
F (11 bits), G (10 bits), H (9 bits), I (7 bits), J (6 bits), K (5 bits),
L (4 bits), M (3 bits), N (2 bits), O (1 bit)

These bits can be laid out in a 16-byte array like this:

  Byte Offset
0123456789ABCDEF

Bit

7 HHHHHHHHHIIIIIII
6 GGGGGGGGGGJJJJJJ
5 FFFFFFFFFFFKKKKK
4 EEEEEEEEEEEELLLL
3 DDDDDDDDDDDDDMMM
2 CCCCCCCCCCCCCCNN
1 BBBBBBBBBBBBBBBO
0 AAAAAAAAAAAAAAAA

For example, to test bit X of A, we evaluate ((bits[X] & 1) != 0), or to
test bit X of I, we evaluate ((bits[9 + X] & 0x80) != 0). This can be done
in 1-2 machine instructions on x86, or 4-6 instructions on ARM.

This uses the LPT multiprocessor scheduling algorithm to lay out the bits
efficiently.

Saves ~450KB of instructions in a recent build of Chromium.

Diff Detail

Repository
rL LLVM

Event Timeline

pcc updated this revision to Diff 20883.Feb 27 2015, 12:30 PM
pcc retitled this revision from to LowerBitSets: Use byte arrays instead of bit sets to represent in-memory bit sets..
pcc updated this object.
pcc edited the test plan for this revision. (Show Details)
pcc added reviewers: kcc, jfb.
pcc added a subscriber: Unknown Object (MLST).
kcc edited edge metadata.Mar 2 2015, 1:19 PM

Saves ~450KB of instructions in a recent build of Chromium.

Out of how many?

kcc added a comment.Mar 2 2015, 1:24 PM

Saves ~450KB of instructions in a recent build of Chromium.

Out of how many?

pcc added a comment.Mar 2 2015, 1:37 PM

The current (i.e. after this patch) .text overhead is around 6.2MB, so around 7%.

kcc added inline comments.Mar 2 2015, 1:51 PM
include/llvm/Transforms/IPO/LowerBitSets.h
34 ↗(On Diff #20883)

Is this better than unordered_set?
(I don't know, just asking)

183 ↗(On Diff #20883)

LPT == longest process time?
Maybe mention the full name here?

lib/Transforms/IPO/LowerBitSets.cpp
34 ↗(On Diff #20883)

It would be interesting to see more details stats here,
e.g. how many bits were packed into how many bytes.

jfb added inline comments.Mar 2 2015, 2:34 PM
include/llvm/Transforms/IPO/LowerBitSets.h
174 ↗(On Diff #20883)

Could you add a symbolic constant for 8? Also, why 8? Is it good for x86 but now as good for ARM?

178 ↗(On Diff #20883)

memset or std::fill.

lib/Transforms/IPO/LowerBitSets.cpp
34 ↗(On Diff #20883)

+1

How many bits are unused (didn't fit) would be interesting.

152 ↗(On Diff #20883)

Update 8.

298 ↗(On Diff #20883)

Capitalize RAUW.

360 ↗(On Diff #20883)

Just to be sure I understand: this inlines small (<= 64 bit) bitsets. The current patch only benefits from co-locating larger bitsets, right? It would probably be good for the docs to specify this double approach to reducing bitset footprint.

pcc updated this revision to Diff 21052.Mar 2 2015, 3:34 PM
pcc edited edge metadata.

Address reviewer comments.

include/llvm/Transforms/IPO/LowerBitSets.h
34 ↗(On Diff #20883)

Maybe, not sure though. DenseSet may be even better here. But I'd rather wait until this pass isn't a small fraction of the overall compile time before trying to optimize our data structures.

174 ↗(On Diff #20883)

Are you asking why I am using 8 bits per array element? Well, for one thing it gives us better packing (the more bins there are, the less evenly they will be filled). For another, the instruction sequences can be slightly shorter, both on x86 and ARM. For example, this code:

typedef unsigned long T;
typedef void (*F)(void);
T table;

unsigned char bitset[1000];
void FOO(T **obj) {
  T *vptr = *obj;
  T ptr_offset = (T)vptr - (T)table;
  T t1 = ptr_offset << (sizeof(void*)*8 - 3);
  T t2 = ptr_offset >> 3;
  T t3 = t1 | t2;
  unsigned bits = bitset[t3];
  if ((t3 >= TABLE_SIZE) || (bits & 1))
    __builtin_trap();
  F f = (F)vptr[4];
  f();
}

is one byte shorter (on x86) or one instruction shorter (on ARM) than this code:

typedef unsigned long T;
typedef void (*F)(void);
T table;

unsigned short bitset[1000];
void FOO(T **obj) {
  T *vptr = *obj;
  T ptr_offset = (T)vptr - (T)table;
  T t1 = ptr_offset << (sizeof(void*)*8 - 3);
  T t2 = ptr_offset >> 3;
  T t3 = t1 | t2;
  unsigned bits = bitset[t3];
  if ((t3 >= TABLE_SIZE) || (bits & 0x100))
    __builtin_trap();
  F f = (F)vptr[4];
  f();
}

I guess I could parameterise this but it seems like it would add more complexity than we need. Happy to document the choice of 8 bits per array element.

178 ↗(On Diff #20883)

Done

183 ↗(On Diff #20883)

Done

lib/Transforms/IPO/LowerBitSets.cpp
34 ↗(On Diff #20883)

I've added some more stats here.

152 ↗(On Diff #20883)

See my other comment.

298 ↗(On Diff #20883)

Done

360 ↗(On Diff #20883)

We already document the technique for inlining <=64 bits:

http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#short-inline-bit-vectors

and I do plan to document byte arrays on that page once this lands.

jfb added inline comments.Mar 2 2015, 3:43 PM
include/llvm/Transforms/IPO/LowerBitSets.h
174 ↗(On Diff #20883)

Yes, documentation for the choice of 8 is good :)

lib/Transforms/IPO/LowerBitSets.cpp
360 ↗(On Diff #20883)

sgtm

pcc updated this revision to Diff 21059.Mar 2 2015, 4:07 PM
  • Document choice of 1 byte per array element
include/llvm/Transforms/IPO/LowerBitSets.h
174 ↗(On Diff #20883)

Done

jfb added inline comments.Mar 2 2015, 4:10 PM
include/llvm/Transforms/IPO/LowerBitSets.h
179 ↗(On Diff #21059)

I'd still rather have to 8 pulled out as its own symbolic constant, instead of being used here and below as a magic number.

pcc updated this revision to Diff 21063.Mar 2 2015, 4:34 PM
  • Use a constant for the number of bits per byte
include/llvm/Transforms/IPO/LowerBitSets.h
179 ↗(On Diff #21059)

done

jfb accepted this revision.Mar 2 2015, 4:38 PM
jfb edited edge metadata.

lgtm

This revision is now accepted and ready to land.Mar 2 2015, 4:38 PM
This revision was automatically updated to reflect the committed changes.