Page MenuHomePhabricator

[ELF] - Speedup MergeInputSection::splitStrings
Needs ReviewPublic

Authored by grimar on Apr 12 2018, 8:12 AM.

Details

Reviewers
ruiu
espindola
Summary

This is for https://bugs.llvm.org//show_bug.cgi?id=37029,
which was about the experiment of using hash_value for splitting strings.

hash_value at some point for short strings falls back to hash_short:
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/ADT/Hashing.h#L453

I think we can use it instead of xxHash64 for `splitStrings, as this method
uses uint64_t as a return value and shows really good results it seems.
It computes the hash of the part of the string, but this seems to be OK here.

I used scylla to profile changes and benchmarked few algorithms.
splitNonStrings did not show up in profile,
so I experimented only on changing splitStrings. Results are below:

* Default (xxHash64):                             CPU(%)   CPU(ms)
- lld.exe                                                      100.00%   4254
 + lld::elf::MergeInputSection::splitStrings  21.86%    930

* With use of hash_value:
- lld.exe                                                     100.00%   4001
 + lld::elf::MergeInputSection::splitStrings  18.25%    730

* With use of hashGnu:
- lld.exe                                                     100.00 %  4469
 + lld::elf::MergeInputSection::splitStrings  25.60 %  1144

* With use of hashSysV:
- lld.exe (PID: 5716)                                   100.00 %  5080
 + lld::elf::MergeInputSection::splitStrings  33.40 %  1711

* This patch:
- lld.exe (PID: 9192)                                  100.00%   3866
 + lld::elf::MergeInputSection::splitStrings  13.24 %   512

So this change improves total CPU time by about 10% (4254/3866) for Scylla.
And makes splitStrings about 80% faster.
(Note that is the time that profiler shows, I did not yet try to benchmark it
in a regular way).

Seed value used was taken from:
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/ADT/Hashing.h#L328
It is equal to default seed used by hash_value

Diff Detail

Event Timeline

grimar created this revision.Apr 12 2018, 8:12 AM
grimar edited the summary of this revision. (Show Details)Apr 12 2018, 8:12 AM

Interesting. The patch by itself seems fine. I will benchmark it locally.

espindola added inline comments.Apr 12 2018, 11:21 AM
ELF/InputSection.cpp
898

For consistency we should probably also replace this use of xxHash64.

grimar added inline comments.Apr 12 2018, 12:14 PM
ELF/InputSection.cpp
898

I just did not find a good test for that place. I tried to do that change, but for Scylla, there was no difference, so decided to not do that in this patch to keep the focus on the main place.

I just noticed that hash_short will read at most 64 bytes of the string.

This could cause a pathological case if many symbols have a common prefix, no?

One interesting thing about the current setup is that we first read the bytes in the string looking for the 0 that terminates the string. We then read them again to hash them. At least with a simple hash like djb (what is implemented in hashGnu) it should not be too hard to read each byte once. Would you mind giving that a try?

ELF/InputSection.cpp
898

I don't think there is ever a case where a non SHF_STRING SHF_MERGE section shows up in the profile.

ruiu added inline comments.Apr 12 2018, 1:45 PM
ELF/InputSection.cpp
867–868

This is interesting, but if this is effective, we should do that in StringRef::hash so that it speeds up everyone's code, shouldn't we?

ruiu added a comment.Apr 12 2018, 1:46 PM

What is your operating system and CPU?

I just noticed that hash_short will read at most 64 bytes of the string.

This could cause a pathological case if many symbols have a common prefix, no?

Sure, we will have many hash collisions then, but how much is real?

I see at least 2 good (IMO) solutions for that:

  1. One of the options is to use hash_short for short strings only (<64) and fall back to something else otherwise,

but I would give a chance to the current way.

  1. Also, we could just pass the 64 bytes from the middle of the string, assuming it covers both the prefix, data itself and postfix.

One interesting thing about the current setup is that we first read the bytes in the string looking for the 0 that terminates the string. We then read them again to hash them. At least with a simple hash like djb (what is implemented in hashGnu) it should not be too hard to read each byte once. Would you mind giving that a try?

Sure, I'll try.

What is your operating system and CPU?

Windows 8, i7-4970K @ 4.00Ghz, 32gb ram.

One interesting thing about the current setup is that we first read the bytes in the string looking for the 0 that terminates the string. We then read them again to hash them. At least with a simple hash like djb (what is implemented in hashGnu) it should not be too hard to read each byte once. Would you mind giving that a try?

No much difference for me:

Function Name                                   Total CPU(%) Total CPU (ms)

* After the change:
- lld.exe (PID: 15032)                          100.00        4166
 + lld::elf::MergeInputSection::splitIntoPieces  22.40         933

* Default (xxHash64):
- lld.exe                                         100.00%        4254
 + lld::elf::MergeInputSection::splitStrings       21.86%         930

My test code was:

static size_t findNull(StringRef S, size_t EntSize, uint32_t& H) {
  uint32_t Hash = 5381;

  // Optimize the common case.
  if (EntSize == 1) {
    const char *Data = S.data();
    while (uint8_t C = *Data++)
      Hash = (Hash << 5) + Hash + C;
    H = Hash;
    return Data - S.data();
  }

  llvm_unreachable("scylla?");

  for (unsigned I = 0, N = S.size(); I != N; I += EntSize) {
    const char *B = S.begin() + I;
    if (std::all_of(B, B + EntSize, [](char C) { return C == 0; }))
      return I;
  }
}


// Split SHF_STRINGS section. Such section is a sequence of
// null-terminated strings.
void MergeInputSection::splitStrings(ArrayRef<uint8_t> Data, size_t EntSize) {
  size_t Off = 0;
  bool IsAlloc = Flags & SHF_ALLOC;
  StringRef S = toStringRef(Data);

  while (!S.empty()) {
    uint32_t Hash;
    size_t Size = findNull(S, EntSize, Hash);

    Pieces.emplace_back(Off, Hash, !IsAlloc);
    S = S.substr(Size);
    Off += Size;
  }
}
grimar updated this revision to Diff 142377.Apr 13 2018, 4:25 AM
  • Removed excessive change.
ELF/InputSection.cpp
867–868

Hmm. I thought it has a minor but positive effect with this change.
But today I re-profiled exactly this place using more iterations and it seems it was a computation error.
So I removed this part.

OK,

No much difference for me:

Function Name                                   Total CPU(%) Total CPU (ms)

* After the change:
- lld.exe (PID: 15032)                          100.00        4166
 + lld::elf::MergeInputSection::splitIntoPieces  22.40         933

* Default (xxHash64):
- lld.exe                                         100.00%        4254
 + lld::elf::MergeInputSection::splitStrings       21.86%         930

On the description you report that just using gnuHash is 4469, so I think some reasonable hypothesis so far:

  • Reading the value only once is a good improvement.
  • Reading a byte at a time is a noticeable lost.

so it would probably be ideal to combine some of the memchr tricks for reading multiple bytes at a time with a simpleish hash that can combine more than one char at a time.

It should also be possible to template Hash.h over the returned type so that some clients can explicitly request a 32 or 64 bit hash. Not sure if that change would be accepted.

It should also be possible to template Hash.h over the returned type so that some clients can explicitly request a 32 or 64 bit hash. Not sure if that change would be accepted.

Do you know why it have to use size_t, btw? Given that hash_value falls back to short_hash that returns uint64_t, I wonder if it was intentional design or can be changed to always use uint64_t.

I am also will be happy to experiment with your suggestions. Will only be able to start that a bit later though (I am off during the next week because of llvm-euro).

It should also be possible to template Hash.h over the returned type so that some clients can explicitly request a 32 or 64 bit hash. Not sure if that change would be accepted.

Do you know why it have to use size_t, btw? Given that hash_value falls back to short_hash that returns uint64_t, I wonder if it was intentional design or can be changed to always use uint64_t.

I think part of the reason is the desire to make the results unstable. The hashing API can return different results in different machines. If there is a hash algorithm that works better with AVX, that algorithm can be used if the host has AVX.

What we need is some form of signature/digest. We can change it as lld evolves, but it has to produce the same result in every host.

ruiu added a comment.Apr 18 2018, 1:18 PM

I believe a fast strlen() can be implemented using SSE instructions. And if you are using SSE instructions, your data is loaded to XMM registers. I believe there exists a fast vectorized hash function that works on data on a XMM register. I wonder if we can combine the two to create a single fast function that returns the length of a string as well as its hash value.

I believe a fast strlen() can be implemented using SSE instructions. And if you are using SSE instructions, your data is loaded to XMM registers. I believe there exists a fast vectorized hash function that works on data on a XMM register. I wonder if we can combine the two to create a single fast function that returns the length of a string as well as its hash value.

Probably, but I would suggest not going that far in the first patch. Also note that we can use a memchr ,which is a bit easier than strlen. The hash_value in llvm uses 64 bits at a time. Given that using a byte at a time djbhash was already a small speedup, using 64 bits at a time for a combined memchr and hash should be very helpful.

grimar planned changes to this revision.Apr 23 2018, 8:44 AM
grimar updated this revision to Diff 143758.EditedApr 24 2018, 9:07 AM
  • Reimplemented. With that implementation timings for scylla are:
This patch:
- lld.exe                                                    100.00%   3944
 + lld::elf::MergeInputSection::splitStrings  10.75%    424

LLD head:
- lld.exe                                                    100.00%   4234
 + lld::elf::MergeInputSection::splitStrings  21.82%    924
espindola added inline comments.Apr 24 2018, 11:07 AM
ELF/InputSection.cpp
887

This will produce different results on a big endian host, no?

895

I don't know enough about hashing to judge if this is a reasonable extension of the djb hash for using 4 bytes at a time, but we can probably start with it.

grimar updated this revision to Diff 143910.Apr 25 2018, 5:32 AM
  • Addressed comments.
ELF/InputSection.cpp
887

You are right..

To solve this,
I tried to rewrite hashing code right below to read byte by byte in a loop, but that damages the performance too much.

Then tried both read32be() and read32le() (my host is LE). Avg time for them was the same and has no difference from *reinterpret_cast<const uint32_t *>, so it seems we can use it.

895

I experimented here, any bad hashing increases hash collisions, what instantly shows up in the profile. My approach seems works well, so I think it is OK. It is simpler than taking single bytes, and also the loop reading bytes worked slower for me.

grimar planned changes to this revision.Apr 25 2018, 10:04 AM
grimar updated this revision to Diff 143953.Apr 25 2018, 10:12 AM
  • Fixed last minute issue (forgot to rename read32le to generic read32).
espindola added inline comments.Apr 26 2018, 7:19 PM
ELF/InputSection.cpp
878

I think I just realized a problem with this. If you have a string that is more than 4 bytes long, its hash value depends on its address, no? For example, if the string foobarzed starts at a position 4 byte aligned, we will hash it as

foob arzed \0

but if it is offset by one byte, it will be hashed as

f ooba zed\0

You can probably just delete the initial alignment loop. This will produce slow code on cpus that don't support fast unaligned loads, but we are already very slow in those cases.

grimar added inline comments.Apr 27 2018, 6:25 AM
ELF/InputSection.cpp
878

Oh, nice catch!

I removed the loop and what is interesting, it sped up the timings for me even more!
Time went from ~3968 to ~3704 (total LLD CPU time). And from ~443 to ~415 for splitStrings.
Seems it means it probably produces better hash with that change. I'll update the patch.

grimar updated this revision to Diff 144326.Apr 27 2018, 6:27 AM
  • Addressed review comment.

I think this is fine. I will run the benchmarks locally to confirm.

ELF/InputSection.cpp
893

It should be OK to return a std::pair, no?

You have to update

lld :: ELF/compressed-debug-input.s                 
lld :: ELF/relocatable-compressed-input.s

I guess you don't have zlib installed.

You have to update

lld :: ELF/compressed-debug-input.s                 
lld :: ELF/relocatable-compressed-input.s

I guess you don't have zlib installed.

I'll check, thanks. zlib is generally not enabled/supported for windows LLVM configuration still, I believe.

ELF/InputSection.cpp
893

I tried, but it affected the perfomance for me. I can probably add Hash reference out argument instead.

ruiu added a comment.Apr 27 2018, 2:03 PM

For some reason I missed this thread. I'll review this, but it looks pretty promising!

ruiu added inline comments.Apr 27 2018, 2:13 PM
ELF/InputSection.cpp
872

Split this entire function into two; one for '\0'-terminated string and the other for multi-word string. That's better than doing it inside the while loop for readability.

873

I'd name this H.

876

sizeof(uint32_t) is always 4, so please write just 4.

881

Omit != 0 because it is always implied.

884

I don't think you need to update DataSize. You know the end position of S, so you can compare Data with it to see if you've reached end of a string.

885

* 33 is perhaps better for brevity.

885–887

This is very odd, and we should avoid returning two values in one word. Since findSizeAndHash is inlined, I don't think you need this.

890

Likewise, maintaining both DataSize and Data doesn't seem to make sense to me.

897

You should be able to handle this case gracefully.

The test results are interesting

The geometric mean for seconds-elapsed improves by 0.5%. Scylla is just 0.3% faster.

Part of the reason is a 12% regression on the number of instructions for scylla. Maybe that is because xxhash64 hashes 8 bytes at a time? Have you experimented with reading 8 bytes at a time?

ruiu added a comment.Apr 27 2018, 3:03 PM

Rafael,

If you have time, can you plug in my change and compare that in your environment? I didn't see any noticeable difference on my machine, but it might on other environment.

The test results are interesting

The geometric mean for seconds-elapsed improves by 0.5%. Scylla is just 0.3% faster.

Strange, because I am observing stable major boost. Restested today just in case.
Total CPU time, 3 runs, MSVS profiler, RelWithDebInfo configuration, linking \lld-speed-test\scylla.
Windows 8, i7-4970K @ 4.00Ghz, 32gb ram.

Numbers:

  1. LLD (r331097): 4057ms, 4076ms, 4028ms
  2. This patch: 3537ms, 3511ms, 3531ms.

Part of the reason is a 12% regression on the number of instructions for scylla. Maybe that is because xxhash64 hashes 8 bytes at a time? Have you experimented with reading 8 bytes at a time?

It was slower for me. But I think it worth to retest. I am going to address the rest comments
so that it would be easy to switch between 4 and 8 bytes reading at a time and then will retest the difference between these 2.

grimar marked 8 inline comments as done.Apr 29 2018, 6:25 PM

Update:
Time of D46163 for me was: 3873, 3916, 3937.

grimar updated this revision to Diff 144499.Apr 29 2018, 6:31 PM
  • Addressed review comments.
grimar added a comment.EditedApr 29 2018, 6:35 PM

I tried reading both 4 and 8 bytes at one. Code for 8 bytes was:

static inline size_t findSizeAndHash(StringRef S, uint64_t &Hash) {
  const char *Data = S.data();
  const char *const End = Data + S.size();

  // We are going to calculate simple hash based on DJB hash below. Hash is
  // calculated as the same time as we read the string bytes for speedup.
  uint64_t H = 5381;

  // Load a word at a time and test if any of bytes is 0-byte.
  while (End - Data > 8) {
    uint64_t Word = read64(Data);
    // This checks if at least one byte of a word is a null byte.
    // If we found such case we want to break the loop and continue
    // testing the single bytes to find the exact null byte position.
    if ((Word - 0x0101010101010101) & ~Word & 0x8080808080808080)
      break;
    Data += 8;
    H = H * 33 + Word;
  }

  // Now find the exact position of the null byte. Do not forget to
  // update the hash value too.
  while (End - Data) {
    H = H * 33 + *Data;
    if (!*Data) {
      Hash = H;
      return Data - S.data() + 1;
    }
    ++Data;
  }

  llvm_unreachable("string is not null terminated");
}

It seems generally works about 50ms slower than 4 bytes at a time (posted diff) for me, though
the difference is so minor sometimes that I am inclined to think it can be the calculation error probably.

Do we want it?