This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Place SectionPiece::{Live,Hash} bit fields together
ClosedPublic

Authored by MaskRay on Apr 16 2019, 2:15 AM.

Details

Summary

We read Live and write to OutputOff simultaneously in 3 places.
Separating them avoids data sharing and data races like D41884/PR35788.
This patch places Live and Hash together.

2 reasons this is appealing:

  1. Hash is immutable. Live is almost read-only - only written once in MarkLive.cpp where Hash is not accessed
  2. we already discard low bits of Hash to decide ShardID. It doesn't matter much if we discard another bit.

Because all the use sites of OutputOff expect uint64_t/size, change its
type from int64_t to uint64_t.

Event Timeline

MaskRay created this revision.Apr 16 2019, 2:15 AM
ruiu added a comment.Apr 16 2019, 2:37 AM

This change seems fine as long as it doesn't cause any performance regression. Did you run lld with this change to see if the increase of hash collision doesn't have a negative impact?

I'm trying to add another field to the struct which is likely to conflict with this change.

So, we have 128 bits for this struct. 32 bits are used for the input offset. I want to pack the following bits to the remaining 96 bits.

  • 32 bit hash value
  • 1 bit for liveness
  • 40 bits for the output offset (should be large enough)
  • 5 bits for tail-merge shard ID

The sum is 78 bits, so we have enough bits. The problem is that accessing these bitfields are not thread-safe. So, how about defining these fields as follows

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;

and then provide a few accessor functions, namely getOutputOff and setOutputOff, to this struct? Then accesses to this class become thread-safe.

Did you run lld with this change to see if the increase of hash collision doesn't have a negative impact?

This patch has improvement, likely due to 2938: if (!Sec->Pieces[I].Live) return;

Before: 19.871 +- 0.149 seconds time elapsed ( +- 0.75% )
After: 19.207 +- 0.118 seconds time elapsed ( +- 0.61% )

I'm trying to add another field to the struct which is likely to conflict with this change.

Do you mean D38528? I haven't read that. I'll check that and think about your comments carefully tomorrow.

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;

Do you mean

// 16 bytes -> 12 bytes
uint32_t InputOff;

uint32_t Live : 1
uint32_t Hash : 18;
uint32_t TailShardId : 5;
uint32_t OutputOffHi : 8;

uint32_t OutputOffLo;

Looks fine.

ruiu added a comment.Apr 17 2019, 12:38 AM
uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;

Do you mean

// 16 bytes -> 12 bytes
uint32_t InputOff;

uint32_t Live : 1
uint32_t Hash : 18;
uint32_t TailShardId : 5;
uint32_t OutputOffHi : 8;

uint32_t OutputOffLo;

Looks fine.

Reducing the size to 12 bytes looks like a good idea, but I'm worried if 18 bits hash may be too short. This hash value will be used to determine a bucket in a hash table, so reducing the number of hash bits means that we'll likely to have more collisions.

MaskRay added a comment.EditedApr 17 2019, 12:51 AM
uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;

Do you mean

// 16 bytes -> 12 bytes
uint32_t InputOff;

uint32_t Live : 1
uint32_t Hash : 18;
uint32_t TailShardId : 5;
uint32_t OutputOffHi : 8;

uint32_t OutputOffLo;

Looks fine.

Reducing the size to 12 bytes looks like a good idea, but I'm worried if 18 bits hash may be too short. This hash value will be used to determine a bucket in a hash table, so reducing the number of hash bits means that we'll likely to have more collisions.

It was my interpretation of your previous comment. I thought you wanted to do that (Now I think I don't understand what you were getting at):

40 bits for the output offset (should be large enough)

I am happy with the current 16-byte SectionPiece. Making OutputOffHi part of either Hash or Live can have data sharing caveats.

ruiu added a comment.Apr 17 2019, 1:00 AM

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

I don't understand the split of uint64_t OutputOff -> uint8_t OutputOffHi+uint32_t OutputOffLo (40 bits).

uint32_t InputOff;

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;  // this may be written concurrently with the read of `Live` -> race

uint32_t OutputOffLo;
ruiu added a comment.Apr 17 2019, 1:44 AM

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

I don't understand the split of uint64_t OutputOff -> uint8_t OutputOffHi+uint32_t OutputOffLo (40 bits).

uint32_t InputOff;

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;  // this may be written concurrently with the read of `Live` -> race

uint32_t OutputOffLo;

Does that really race? OutputOffHi and Live are in different uint8_t in the above struct, so a write to OutputOffHi doesn't race with a read of Live, no?

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

I don't understand the split of uint64_t OutputOff -> uint8_t OutputOffHi+uint32_t OutputOffLo (40 bits).

uint32_t InputOff;

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;  // this may be written concurrently with the read of `Live` -> race

uint32_t OutputOffLo;

Does that really race? OutputOffHi and Live are in different uint8_t in the above struct, so a write to OutputOffHi doesn't race with a read of Live, no?

It doesn't race. Since OutputOffHi is not a bitfield, it takes a different memory location. Concurrent accesses to different memory locations don't race. I still don't understand the split of OutputOff. If you pack other members adjacent to OutputOffHi and read them, it may race with the write of OutputOffHi.

ruiu added a comment.Apr 17 2019, 10:32 PM

It doesn't race. Since OutputOffHi is not a bitfield, it takes a different memory location. Concurrent accesses to different memory locations don't race. I still don't understand the split of OutputOff. If you pack other members adjacent to OutputOffHi and read them, it may race with the write of OutputOffHi.

I don't know if I understand what you said correctly, so let me write down my understanding.

  • The main purpose of this patch is to make concurrent accesses to OutputOff and other members safe.
  • The OuptutOff member currently races with Live member because they are bitfields of the same memory location.
  • In order to make OutputOff race-free, we should make OutputOff a non-bitfield member.

Are we still on the same page? If so, please look at the following class.

class SectionPiece {
public:
  SectionPiece(size_t Off, uint32_t Hash, bool Live)
      : InputOff(Off), Live(Live || !Config->GcSections), Hash(Hash >> 1) {}

  uint64_t setOutputOff(uint64_t Val) {
    OutputOffHi = Val >> 32;
    OutputOffLo = Val;
  }

  uint64_t getOutputOff() const {
    return (uint64_t(OutputOffHi) << 32) | OutputOffLo;
  }

  uint32_t InputOff;
  uint32_t Hash;
  uint8_t TailHash;
  uint8_t Live;

private:
  uint8_t OutputOffHi
  uint32_t OutputOffLo;
};

Notice that {get,set}OutputOff don't race with accesses to InputOff/Hash/TailHash/Live. So the goal has achieved. Also with this scheme we don't sacrifice the bits of Hash -- Hash member still has the full 32 bits.

MaskRay added a comment.EditedApr 18 2019, 12:18 AM

Notice that {get,set}OutputOff don't race with accesses to InputOff/Hash/TailHash/Live. So the goal has achieved. Also with this scheme we don't sacrifice the bits of Hash -- Hash member still has the full 32 bits.

I get your idea now: splitting OutputOff to make space for Hash (to keep it 32 bits). However, the benchmark shows 31-bit Hash works just fine.

For a huge internal executable (1.6GiB clang -O3), Strings in StringTableBuilder::finalizeStringTable contains at most 310253 elements.
Every pair has a probability 2^(-31) of colliding. The expected number of pair-wise collisions is 2^(-31) * C(310253,2) ~= 22.41. Note, this number is pair-wise - if 5 elements hash to the same value, they count as C(5,2) collisions. Assume every but one bucket has at most 1 element, that bucket with collision has at most 7 elements => The degraded performance is nearly nothing.

So for simplicity, I prefer leaving OutputOff as is.

ruiu accepted this revision.Apr 18 2019, 12:24 AM

LGTM

Notice that {get,set}OutputOff don't race with accesses to InputOff/Hash/TailHash/Live. So the goal has achieved. Also with this scheme we don't sacrifice the bits of Hash -- Hash member still has the full 32 bits.

I get your idea now: splitting OutputOff to make space for Hash (to keep it 32 bits). However, the benchmark shows 31-bit Hash works just fine.

For a huge internal executable (1.6GiB clang -O3), Strings in StringTableBuilder::finalizeStringTable contains at most 310253 elements.
Every pair has a probability 2^(-31) of colliding. The expected number of pair-wise collisions is 2^(-31) * C(310253,2) ~= 22.41. Note, this number is pair-wise - if 5 elements hash to the same value, they count as C(5,2) collisions. Assume every but one bucket has at most 1 element, that bucket with collision has at most 7 elements => The degraded performance is nearly nothing.

So for simplicity, I prefer leaving OutputOff as is.

Fair. Thank you for the numbers.

This revision is now accepted and ready to land.Apr 18 2019, 12:24 AM
This revision was automatically updated to reflect the committed changes.
ruiu added a comment.Apr 18 2019, 12:53 AM

By the way I think your finding that this struct can be 12 bytes long instead of 16 bytes long is pretty interesting. Previously, we found that making this struct as small as possible does matter in terms of performance perhaps due to memory locality. So, shaving off 4 bytes from this might make a noticeable difference in speed. Do you want to try? (If not I'll do that sometime in the future.)

By the way I think your finding that this struct can be 12 bytes long instead of 16 bytes long is pretty interesting. Previously, we found that making this struct as small as possible does matter in terms of performance perhaps due to memory locality. So, shaving off 4 bytes from this might make a noticeable difference in speed. Do you want to try? (If not I'll do that sometime in the future.)

I tested it https://reviews.llvm.org/differential/diff/195689/ Making SectionPiece 12 bytes has some negative impact on the performance (likely due to extra OutputSec packing/unpacking instructions). So it isn't worth to do that.

Before: 17.177 +- 0.133 seconds time elapsed ( +- 0.77% )
After: 17.279 +- 0.171 seconds time elapsed ( +- 0.99% )

ruiu added a comment.Apr 18 2019, 2:15 AM

By the way I think your finding that this struct can be 12 bytes long instead of 16 bytes long is pretty interesting. Previously, we found that making this struct as small as possible does matter in terms of performance perhaps due to memory locality. So, shaving off 4 bytes from this might make a noticeable difference in speed. Do you want to try? (If not I'll do that sometime in the future.)

I tested it https://reviews.llvm.org/differential/diff/195689/ Making SectionPiece 12 bytes has some negative impact on the performance (likely due to extra OutputSec packing/unpacking instructions). So it isn't worth to do that.

Before: 17.177 +- 0.133 seconds time elapsed ( +- 0.77% )
After: 17.279 +- 0.171 seconds time elapsed ( +- 0.99% )

Thank you for measuring!

The numbers are interesting. That's somewhat counter-intuitive, but perhaps a reduced hash size did a bad thing? I'd think that 34 bits should be enough for OutputOff though.

MaskRay added a comment.EditedApr 18 2019, 2:56 AM

That's somewhat counter-intuitive, but perhaps a reduced hash size did a bad thing? I'd think that 34 bits should be enough for OutputOff though.

The class has false sharing problems.

My perf stat results varied. I think the difference was noise. I cannot say if uint32_t Live : 1; uint32_t Hash : 29; uint32_t OutputOffHi : 2; uint32_t OutputOffLo = 0; and uint32_t Live : 1; uint32_t Hash : 23; uint32_t OutputOffHi : 8; uint32_t OutputOffLo = 0; make it faster or slower.

However, I just noticed that the split will be unsafe and that can't be fixed by reordering if conditions.

if (!Sec->Pieces[I].Live) // unsafe to read in another thread
  continue;
size_t ShardId = getShardId(Sec->Pieces[I].Hash); // unsafe to read in another thread
if ((ShardId & (Concurrency - 1)) == ThreadId)
  Sec->Pieces[I].setOutputOff(Shards[ShardId].add(Sec->getData(I))); // OutputOffHi is being written