This is an archive of the discontinued LLVM Phabricator instance.

LowerBitSets: Use byte arrays instead of bit sets to represent in-memory bit sets.
ClosedPublic

Authored by pcc on Feb 27 2015, 12:30 PM.

Details

Summary

By loading from indexed offsets into a byte array and applying a mask, a
program can test bits from the bit set with a relatively short instruction
sequence. For example, suppose we have 15 bit sets to lay out:

A (16 bits), B (15 bits), C (14 bits), D (13 bits), E (12 bits),
F (11 bits), G (10 bits), H (9 bits), I (7 bits), J (6 bits), K (5 bits),
L (4 bits), M (3 bits), N (2 bits), O (1 bit)

These bits can be laid out in a 16-byte array like this:

  Byte Offset
0123456789ABCDEF

Bit

7 HHHHHHHHHIIIIIII
6 GGGGGGGGGGJJJJJJ
5 FFFFFFFFFFFKKKKK
4 EEEEEEEEEEEELLLL
3 DDDDDDDDDDDDDMMM
2 CCCCCCCCCCCCCCNN
1 BBBBBBBBBBBBBBBO
0 AAAAAAAAAAAAAAAA

For example, to test bit X of A, we evaluate ((bits[X] & 1) != 0), or to
test bit X of I, we evaluate ((bits[9 + X] & 0x80) != 0). This can be done
in 1-2 machine instructions on x86, or 4-6 instructions on ARM.

This uses the LPT multiprocessor scheduling algorithm to lay out the bits
efficiently.

Saves ~450KB of instructions in a recent build of Chromium.

Diff Detail

Event Timeline

pcc updated this revision to Diff 20883.Feb 27 2015, 12:30 PM
pcc retitled this revision from to LowerBitSets: Use byte arrays instead of bit sets to represent in-memory bit sets..
pcc updated this object.
pcc edited the test plan for this revision. (Show Details)
pcc added reviewers: kcc, jfb.
pcc added a subscriber: Unknown Object (MLST).
kcc edited edge metadata.Mar 2 2015, 1:19 PM

Saves ~450KB of instructions in a recent build of Chromium.

Out of how many?

kcc added a comment.Mar 2 2015, 1:24 PM

Saves ~450KB of instructions in a recent build of Chromium.

Out of how many?

pcc added a comment.Mar 2 2015, 1:37 PM

The current (i.e. after this patch) .text overhead is around 6.2MB, so around 7%.

kcc added inline comments.Mar 2 2015, 1:51 PM
include/llvm/Transforms/IPO/LowerBitSets.h
34

Is this better than unordered_set?
(I don't know, just asking)

183

LPT == longest process time?
Maybe mention the full name here?

lib/Transforms/IPO/LowerBitSets.cpp
34–36

It would be interesting to see more details stats here,
e.g. how many bits were packed into how many bytes.

jfb added inline comments.Mar 2 2015, 2:34 PM
include/llvm/Transforms/IPO/LowerBitSets.h
174

Could you add a symbolic constant for 8? Also, why 8? Is it good for x86 but now as good for ARM?

178

memset or std::fill.

lib/Transforms/IPO/LowerBitSets.cpp
34–36

+1

How many bits are unused (didn't fit) would be interesting.

154

Update 8.

300

Capitalize RAUW.

367

Just to be sure I understand: this inlines small (<= 64 bit) bitsets. The current patch only benefits from co-locating larger bitsets, right? It would probably be good for the docs to specify this double approach to reducing bitset footprint.

pcc updated this revision to Diff 21052.Mar 2 2015, 3:34 PM
pcc edited edge metadata.

Address reviewer comments.

include/llvm/Transforms/IPO/LowerBitSets.h
34

Maybe, not sure though. DenseSet may be even better here. But I'd rather wait until this pass isn't a small fraction of the overall compile time before trying to optimize our data structures.

174

Are you asking why I am using 8 bits per array element? Well, for one thing it gives us better packing (the more bins there are, the less evenly they will be filled). For another, the instruction sequences can be slightly shorter, both on x86 and ARM. For example, this code:

typedef unsigned long T;
typedef void (*F)(void);
T table;

unsigned char bitset[1000];
void FOO(T **obj) {
  T *vptr = *obj;
  T ptr_offset = (T)vptr - (T)table;
  T t1 = ptr_offset << (sizeof(void*)*8 - 3);
  T t2 = ptr_offset >> 3;
  T t3 = t1 | t2;
  unsigned bits = bitset[t3];
  if ((t3 >= TABLE_SIZE) || (bits & 1))
    __builtin_trap();
  F f = (F)vptr[4];
  f();
}

is one byte shorter (on x86) or one instruction shorter (on ARM) than this code:

typedef unsigned long T;
typedef void (*F)(void);
T table;

unsigned short bitset[1000];
void FOO(T **obj) {
  T *vptr = *obj;
  T ptr_offset = (T)vptr - (T)table;
  T t1 = ptr_offset << (sizeof(void*)*8 - 3);
  T t2 = ptr_offset >> 3;
  T t3 = t1 | t2;
  unsigned bits = bitset[t3];
  if ((t3 >= TABLE_SIZE) || (bits & 0x100))
    __builtin_trap();
  F f = (F)vptr[4];
  f();
}

I guess I could parameterise this but it seems like it would add more complexity than we need. Happy to document the choice of 8 bits per array element.

178

Done

183

Done

lib/Transforms/IPO/LowerBitSets.cpp
34–36

I've added some more stats here.

154

See my other comment.

300

Done

367

We already document the technique for inlining <=64 bits:

http://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#short-inline-bit-vectors

and I do plan to document byte arrays on that page once this lands.

jfb added inline comments.Mar 2 2015, 3:43 PM
include/llvm/Transforms/IPO/LowerBitSets.h
174

Yes, documentation for the choice of 8 is good :)

lib/Transforms/IPO/LowerBitSets.cpp
367

sgtm

pcc updated this revision to Diff 21059.Mar 2 2015, 4:07 PM
  • Document choice of 1 byte per array element
include/llvm/Transforms/IPO/LowerBitSets.h
174

Done

jfb added inline comments.Mar 2 2015, 4:10 PM
include/llvm/Transforms/IPO/LowerBitSets.h
179

I'd still rather have to 8 pulled out as its own symbolic constant, instead of being used here and below as a magic number.

pcc updated this revision to Diff 21063.Mar 2 2015, 4:34 PM
  • Use a constant for the number of bits per byte
include/llvm/Transforms/IPO/LowerBitSets.h
179

done

jfb accepted this revision.Mar 2 2015, 4:38 PM
jfb edited edge metadata.

lgtm

This revision is now accepted and ready to land.Mar 2 2015, 4:38 PM
This revision was automatically updated to reflect the committed changes.