This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
ELF/
-
InputSection.h
-
SyntheticSections.h
-
SyntheticSections.cpp

Differential D60765

[ELF] Place SectionPiece::{Live,Hash} bit fields together
ClosedPublic

Authored by MaskRay on Apr 16 2019, 2:15 AM.

Download Raw Diff

Details

Reviewers

ruiu
grimar
pcc
• espindola

Commits

rG957c356ffec4: [ELF] Place SectionPiece::{Live,Hash} bit fields together
rL358645: [ELF] Place SectionPiece::{Live,Hash} bit fields together
rLLD358645: [ELF] Place SectionPiece::{Live,Hash} bit fields together

Summary

We read Live and write to OutputOff simultaneously in 3 places.
Separating them avoids data sharing and data races like D41884/PR35788.
This patch places Live and Hash together.

2 reasons this is appealing:

Hash is immutable. Live is almost read-only - only written once in MarkLive.cpp where Hash is not accessed
we already discard low bits of Hash to decide ShardID. It doesn't matter much if we discard another bit.

Because all the use sites of OutputOff expect uint64_t/size, change its
type from int64_t to uint64_t.

Diff Detail

Repository: rLLD LLVM Linker

Event Timeline

MaskRay created this revision.Apr 16 2019, 2:15 AM

Herald added a reviewer: • espindola. · View Herald TranscriptApr 16 2019, 2:15 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, arichardson, emaste. · View Herald Transcript

Harbormaster completed remote builds in B30605: Diff 195340.Apr 16 2019, 2:15 AM

This change seems fine as long as it doesn't cause any performance regression. Did you run lld with this change to see if the increase of hash collision doesn't have a negative impact?

I'm trying to add another field to the struct which is likely to conflict with this change.

So, we have 128 bits for this struct. 32 bits are used for the input offset. I want to pack the following bits to the remaining 96 bits.

32 bit hash value
1 bit for liveness
40 bits for the output offset (should be large enough)
5 bits for tail-merge shard ID

The sum is 78 bits, so we have enough bits. The problem is that accessing these bitfields are not thread-safe. So, how about defining these fields as follows

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;

and then provide a few accessor functions, namely getOutputOff and setOutputOff, to this struct? Then accesses to this class become thread-safe.

Did you run lld with this change to see if the increase of hash collision doesn't have a negative impact?

This patch has improvement, likely due to 2938: if (!Sec->Pieces[I].Live) return;

Before: 19.871 +- 0.149 seconds time elapsed ( +- 0.75% )
After: 19.207 +- 0.118 seconds time elapsed ( +- 0.61% )

I'm trying to add another field to the struct which is likely to conflict with this change.

Do you mean D38528? I haven't read that. I'll check that and think about your comments carefully tomorrow.

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;

Do you mean

// 16 bytes -> 12 bytes
uint32_t InputOff;

uint32_t Live : 1
uint32_t Hash : 18;
uint32_t TailShardId : 5;
uint32_t OutputOffHi : 8;

uint32_t OutputOffLo;

Looks fine.

In D60765#1469717, @MaskRay wrote:

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;

Do you mean

// 16 bytes -> 12 bytes
uint32_t InputOff;

uint32_t Live : 1
uint32_t Hash : 18;
uint32_t TailShardId : 5;
uint32_t OutputOffHi : 8;

uint32_t OutputOffLo;

Looks fine.

Reducing the size to 12 bytes looks like a good idea, but I'm worried if 18 bits hash may be too short. This hash value will be used to determine a bucket in a hash table, so reducing the number of hash bits means that we'll likely to have more collisions.

In D60765#1469806, @ruiu wrote:
In D60765#1469717, @MaskRay wrote:
uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;
uint32_t OutputOffLo;
Do you mean
// 16 bytes -> 12 bytes
uint32_t InputOff;

uint32_t Live : 1
uint32_t Hash : 18;
uint32_t TailShardId : 5;
uint32_t OutputOffHi : 8;

uint32_t OutputOffLo;
Looks fine.
Reducing the size to 12 bytes looks like a good idea, but I'm worried if 18 bits hash may be too short. This hash value will be used to determine a bucket in a hash table, so reducing the number of hash bits means that we'll likely to have more collisions.

It was my interpretation of your previous comment. I thought you wanted to do that (Now I think I don't understand what you were getting at):

40 bits for the output offset (should be large enough)

I am happy with the current 16-byte SectionPiece. Making OutputOffHi part of either Hash or Live can have data sharing caveats.

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

In D60765#1469825, @ruiu wrote:

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

I don't understand the split of uint64_t OutputOff -> uint8_t OutputOffHi+uint32_t OutputOffLo (40 bits).

uint32_t InputOff;

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;  // this may be written concurrently with the read of `Live` -> race

uint32_t OutputOffLo;

In D60765#1469826, @MaskRay wrote:
In D60765#1469825, @ruiu wrote:

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

I don't understand the split of uint64_t OutputOff -> uint8_t OutputOffHi+uint32_t OutputOffLo (40 bits).
uint32_t InputOff;

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;  // this may be written concurrently with the read of `Live` -> race

uint32_t OutputOffLo;

Does that really race? OutputOffHi and Live are in different uint8_t in the above struct, so a write to OutputOffHi doesn't race with a read of Live, no?

In D60765#1469843, @ruiu wrote:
In D60765#1469826, @MaskRay wrote:
In D60765#1469825, @ruiu wrote:

The point of my last comment was that, with the new layout, accesses to OutputOffHi don't race with accesses to Live or Hash, as they are now in different non-integer-bitmap struct members.

I don't understand the split of uint64_t OutputOff -> uint8_t OutputOffHi+uint32_t OutputOffLo (40 bits).
uint32_t InputOff;

uint32_t Hash;
uint8_t Live : 1;
uint8_t TailShardId : 5;
uint8_t OutputOffHi;  // this may be written concurrently with the read of `Live` -> race

uint32_t OutputOffLo;
Does that really race? OutputOffHi and Live are in different uint8_t in the above struct, so a write to OutputOffHi doesn't race with a read of Live, no?

It doesn't race. Since OutputOffHi is not a bitfield, it takes a different memory location. Concurrent accesses to different memory locations don't race. I still don't understand the split of OutputOff. If you pack other members adjacent to OutputOffHi and read them, it may race with the write of OutputOffHi.

It doesn't race. Since OutputOffHi is not a bitfield, it takes a different memory location. Concurrent accesses to different memory locations don't race. I still don't understand the split of OutputOff. If you pack other members adjacent to OutputOffHi and read them, it may race with the write of OutputOffHi.

I don't know if I understand what you said correctly, so let me write down my understanding.

The main purpose of this patch is to make concurrent accesses to OutputOff and other members safe.
The OuptutOff member currently races with Live member because they are bitfields of the same memory location.
In order to make OutputOff race-free, we should make OutputOff a non-bitfield member.

Are we still on the same page? If so, please look at the following class.

class SectionPiece {
public:
  SectionPiece(size_t Off, uint32_t Hash, bool Live)
      : InputOff(Off), Live(Live || !Config->GcSections), Hash(Hash >> 1) {}

  uint64_t setOutputOff(uint64_t Val) {
    OutputOffHi = Val >> 32;
    OutputOffLo = Val;
  }

  uint64_t getOutputOff() const {
    return (uint64_t(OutputOffHi) << 32) | OutputOffLo;
  }

  uint32_t InputOff;
  uint32_t Hash;
  uint8_t TailHash;
  uint8_t Live;

private:
  uint8_t OutputOffHi
  uint32_t OutputOffLo;
};

Notice that {get,set}OutputOff don't race with accesses to InputOff/Hash/TailHash/Live. So the goal has achieved. Also with this scheme we don't sacrifice the bits of Hash -- Hash member still has the full 32 bits.

Notice that {get,set}OutputOff don't race with accesses to InputOff/Hash/TailHash/Live. So the goal has achieved. Also with this scheme we don't sacrifice the bits of Hash -- Hash member still has the full 32 bits.

I get your idea now: splitting OutputOff to make space for Hash (to keep it 32 bits). However, the benchmark shows 31-bit Hash works just fine.

For a huge internal executable (1.6GiB clang -O3), Strings in StringTableBuilder::finalizeStringTable contains at most 310253 elements.
Every pair has a probability 2^(-31) of colliding. The expected number of pair-wise collisions is 2^(-31) * C(310253,2) ~= 22.41. Note, this number is pair-wise - if 5 elements hash to the same value, they count as C(5,2) collisions. Assume every but one bucket has at most 1 element, that bucket with collision has at most 7 elements => The degraded performance is nearly nothing.

So for simplicity, I prefer leaving OutputOff as is.

LGTM

In D60765#1471247, @MaskRay wrote:

Notice that {get,set}OutputOff don't race with accesses to InputOff/Hash/TailHash/Live. So the goal has achieved. Also with this scheme we don't sacrifice the bits of Hash -- Hash member still has the full 32 bits.

I get your idea now: splitting OutputOff to make space for Hash (to keep it 32 bits). However, the benchmark shows 31-bit Hash works just fine.

For a huge internal executable (1.6GiB clang -O3), Strings in StringTableBuilder::finalizeStringTable contains at most 310253 elements.
Every pair has a probability 2^(-31) of colliding. The expected number of pair-wise collisions is 2^(-31) * C(310253,2) ~= 22.41. Note, this number is pair-wise - if 5 elements hash to the same value, they count as C(5,2) collisions. Assume every but one bucket has at most 1 element, that bucket with collision has at most 7 elements => The degraded performance is nearly nothing.

So for simplicity, I prefer leaving OutputOff as is.

Fair. Thank you for the numbers.

This revision is now accepted and ready to land.Apr 18 2019, 12:24 AM

Closed by commit rLLD358645: [ELF] Place SectionPiece::{Live,Hash} bit fields together (authored by MaskRay). · Explain WhyApr 18 2019, 12:44 AM

This revision was automatically updated to reflect the committed changes.

By the way I think your finding that this struct can be 12 bytes long instead of 16 bytes long is pretty interesting. Previously, we found that making this struct as small as possible does matter in terms of performance perhaps due to memory locality. So, shaving off 4 bytes from this might make a noticeable difference in speed. Do you want to try? (If not I'll do that sometime in the future.)

In D60765#1471271, @ruiu wrote:

By the way I think your finding that this struct can be 12 bytes long instead of 16 bytes long is pretty interesting. Previously, we found that making this struct as small as possible does matter in terms of performance perhaps due to memory locality. So, shaving off 4 bytes from this might make a noticeable difference in speed. Do you want to try? (If not I'll do that sometime in the future.)

I tested it https://reviews.llvm.org/differential/diff/195689/ Making SectionPiece 12 bytes has some negative impact on the performance (likely due to extra OutputSec packing/unpacking instructions). So it isn't worth to do that.

Before: 17.177 +- 0.133 seconds time elapsed ( +- 0.77% )
After: 17.279 +- 0.171 seconds time elapsed ( +- 0.99% )

In D60765#1471332, @MaskRay wrote:

In D60765#1471271, @ruiu wrote:

By the way I think your finding that this struct can be 12 bytes long instead of 16 bytes long is pretty interesting. Previously, we found that making this struct as small as possible does matter in terms of performance perhaps due to memory locality. So, shaving off 4 bytes from this might make a noticeable difference in speed. Do you want to try? (If not I'll do that sometime in the future.)

I tested it https://reviews.llvm.org/differential/diff/195689/ Making SectionPiece 12 bytes has some negative impact on the performance (likely due to extra OutputSec packing/unpacking instructions). So it isn't worth to do that.

Before: 17.177 +- 0.133 seconds time elapsed ( +- 0.77% )
After: 17.279 +- 0.171 seconds time elapsed ( +- 0.99% )

Thank you for measuring!

The numbers are interesting. That's somewhat counter-intuitive, but perhaps a reduced hash size did a bad thing? I'd think that 34 bits should be enough for OutputOff though.

That's somewhat counter-intuitive, but perhaps a reduced hash size did a bad thing? I'd think that 34 bits should be enough for OutputOff though.

The class has false sharing problems.

My perf stat results varied. I think the difference was noise. I cannot say if uint32_t Live : 1; uint32_t Hash : 29; uint32_t OutputOffHi : 2; uint32_t OutputOffLo = 0; and uint32_t Live : 1; uint32_t Hash : 23; uint32_t OutputOffHi : 8; uint32_t OutputOffLo = 0; make it faster or slower.

However, I just noticed that the split will be unsafe and that can't be fixed by reordering if conditions.

if (!Sec->Pieces[I].Live) // unsafe to read in another thread
  continue;
size_t ShardId = getShardId(Sec->Pieces[I].Hash); // unsafe to read in another thread
if ((ShardId & (Concurrency - 1)) == ThreadId)
  Sec->Pieces[I].setOutputOff(Shards[ShardId].add(Sec->getData(I))); // OutputOffHi is being written

Revision Contents

Path

Size

ELF/

InputSection.h

9 lines

SyntheticSections.h

3 lines

SyntheticSections.cpp

4 lines

Diff 195682

ELF/InputSection.h

	Show First 20 Lines • Show All 221 Lines • ▼ Show 20 Lines
	};			};

	// SectionPiece represents a piece of splittable section contents.			// SectionPiece represents a piece of splittable section contents.
	// We allocate a lot of these and binary search on them. This means that they			// We allocate a lot of these and binary search on them. This means that they
	// have to be as compact as possible, which is why we don't store the size (can			// have to be as compact as possible, which is why we don't store the size (can
	// be found by looking at the next one).			// be found by looking at the next one).
	struct SectionPiece {			struct SectionPiece {
	SectionPiece(size_t Off, uint32_t Hash, bool Live)			SectionPiece(size_t Off, uint32_t Hash, bool Live)
	: InputOff(Off), Hash(Hash), OutputOff(0),			: InputOff(Off), Live(Live \|\| !Config->GcSections), Hash(Hash >> 1) {}
	Live(Live \|\| !Config->GcSections) {}

	uint32_t InputOff;			uint32_t InputOff;
	uint32_t Hash;			uint32_t Live : 1;
	int64_t OutputOff : 63;			uint32_t Hash : 31;
	uint64_t Live : 1;			uint64_t OutputOff = 0;
	};			};

	static_assert(sizeof(SectionPiece) == 16, "SectionPiece is too big");			static_assert(sizeof(SectionPiece) == 16, "SectionPiece is too big");

	// This corresponds to a SHF_MERGE section of an input file.			// This corresponds to a SHF_MERGE section of an input file.
	class MergeInputSection : public InputSectionBase {			class MergeInputSection : public InputSectionBase {
	public:			public:
	template <class ELFT>			template <class ELFT>
	▲ Show 20 Lines • Show All 129 Lines • Show Last 20 Lines

ELF/SyntheticSections.h

	Show First 20 Lines • Show All 853 Lines • ▼ Show 20 Lines

	private:			private:
	// We use the most significant bits of a hash as a shard ID.			// We use the most significant bits of a hash as a shard ID.
	// The reason why we don't want to use the least significant bits is			// The reason why we don't want to use the least significant bits is
	// because DenseMap also uses lower bits to determine a bucket ID.			// because DenseMap also uses lower bits to determine a bucket ID.
	// If we use lower bits, it significantly increases the probability of			// If we use lower bits, it significantly increases the probability of
	// hash collisons.			// hash collisons.
	size_t getShardId(uint32_t Hash) {			size_t getShardId(uint32_t Hash) {
	return Hash >> (32 - llvm::countTrailingZeros(NumShards));			assert((Hash >> 31) == 0);
				return Hash >> (31 - llvm::countTrailingZeros(NumShards));
	}			}

	// Section size			// Section size
	size_t Size;			size_t Size;

	// String table contents			// String table contents
	constexpr static size_t NumShards = 32;			constexpr static size_t NumShards = 32;
	std::vector<llvm::StringTableBuilder> Shards;			std::vector<llvm::StringTableBuilder> Shards;
	▲ Show 20 Lines • Show All 230 Lines • Show Last 20 Lines

ELF/SyntheticSections.cpp

Show First 20 Lines • Show All 2,870 Lines • ▼ Show 20 Lines	void MergeNoTailSection::finalizeContents() {
if (ThreadsEnabled)		if (ThreadsEnabled)
Concurrency =		Concurrency =
std::min<size_t>(PowerOf2Floor(hardware_concurrency()), NumShards);		std::min<size_t>(PowerOf2Floor(hardware_concurrency()), NumShards);

// Add section pieces to the builders.		// Add section pieces to the builders.
parallelForEachN(0, Concurrency, [&](size_t ThreadId) {		parallelForEachN(0, Concurrency, [&](size_t ThreadId) {
for (MergeInputSection *Sec : Sections) {		for (MergeInputSection *Sec : Sections) {
for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I) {		for (size_t I = 0, E = Sec->Pieces.size(); I != E; ++I) {
		if (!Sec->Pieces[I].Live)
		continue;
size_t ShardId = getShardId(Sec->Pieces[I].Hash);		size_t ShardId = getShardId(Sec->Pieces[I].Hash);
if ((ShardId & (Concurrency - 1)) == ThreadId && Sec->Pieces[I].Live)		if ((ShardId & (Concurrency - 1)) == ThreadId)
Sec->Pieces[I].OutputOff = Shards[ShardId].add(Sec->getData(I));		Sec->Pieces[I].OutputOff = Shards[ShardId].add(Sec->getData(I));
}		}
}		}
});		});

// Compute an in-section offset for each shard.		// Compute an in-section offset for each shard.
size_t Off = 0;		size_t Off = 0;
for (size_t I = 0; I < NumShards; ++I) {		for (size_t I = 0; I < NumShards; ++I) {
▲ Show 20 Lines • Show All 432 Lines • Show Last 20 Lines