This is an archive of the discontinued LLVM Phabricator instance.

[libObject/Decompressor] - Use `resize_for_overwrite` in Decompressor::resizeAndDecompress().
AcceptedPublic

Authored by grimar on Dec 23 2020, 7:01 AM.

Details

Summary

I've debugged the code that tries to decompress a large
chunk of data. In my case the size of decompressed data was expected
to be ~4GB.

I've found that the following code

SmallString<0> Out;
...
Out.resize(DecompressedSize);

takes too long. Instead of resize we can use the resize_for_overwrite here.

In Debug configuration Out.resize(0xffffffff) call takes 44,1s,
while Out.resize_for_overwrite(0xffffffff) takes 10.8s for me. I.e. speedup is ~4.08x.

For Release configuration the numbers are:
resize_for_overwrite takes ~1700 microseconds, resize takes ~1850000 microseconds.
I.e. the resize_for_overwrite is ~1088x times faster.

It is very noticable when decompression does not happen. E.g. when the compressed data
is corrupted. In this case the new Decompressor::resizeAndDecompress() finishes much faster.

Diff Detail

Event Timeline

grimar requested review of this revision.Dec 23 2020, 7:01 AM
grimar created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptDec 23 2020, 7:01 AM

Good to know that D93532 added resize_for_overwrite.

For Release configuration the numbers are: resize_for_overwrite takes ~1700 microseconds, resize takes ~1850000 microseconds. I.e. the new version is ~1088x times faster.

IMO this may be misleading. This appears to be faster because previously memset zero was called on the buffer which triggered page faults and allocates the pages immediately.
With the new change the time is amortized with the decompression itself. I think we should measure the total time with decompression.

LG once this description is fixed or the information is removed from the description.

Good to know that D93532 added resize_for_overwrite.

For Release configuration the numbers are: resize_for_overwrite takes ~1700 microseconds, resize takes ~1850000 microseconds. I.e. the new version is ~1088x times faster.

IMO this may be misleading. This appears to be faster because previously memset zero was called on the buffer which triggered page faults and allocates the pages immediately.
With the new change the time is amortized with the decompression itself. I think we should measure the total time with decompression.

LG once this description is fixed or the information is removed from the description.

For me the compressed data is corrupted. I.e. decompression does not happen. I'll update the description.

grimar retitled this revision from [libObject/Decompressor] - Speedup Decompressor::resizeAndDecompress() method. to [libObject/Decompressor] - Use `resize_for_overwrite` in Decompressor::resizeAndDecompress()..Dec 23 2020, 11:34 PM
grimar edited the summary of this revision. (Show Details)
grimar edited the summary of this revision. (Show Details)Dec 23 2020, 11:37 PM

IMO this may be misleading. This appears to be faster because previously memset zero was called on the buffer which triggered page faults and allocates the pages immediately.
With the new change the time is amortized with the decompression itself. I think we should measure the total time with decompression.

Be interesting to see the difference, would be nice to query performance of decompression alone as well. See how much slowdown there is if pages haven't been allocated eagerly.

grimar added a comment.EditedJan 13 2021, 4:11 AM

Ok. I've used the following piece of code to construct a 1Gb of random compressed data and estimated
the difference between the old and the new code in Release.

auto T1 = std::chrono::high_resolution_clock::now();
constexpr uint64_t Size = 1024 * 1024 * 1024uLL;
std::vector<uint8_t> Data;
Data.reserve(Size);
for (size_t I = 0; I != Size; ++I)
  Data.push_back(rand());

auto T2 = std::chrono::high_resolution_clock::now();
outs()
    << "Create data (microseconds): "
    << std::chrono::duration_cast<std::chrono::microseconds>(T2 - T1).count()
    << "\n";

SmallVector<char, 1> CompressedData;
T1 = std::chrono::high_resolution_clock::now();
if (Error Err = zlib::compress(
        StringRef(reinterpret_cast<const char *>(Data.data()), Size),
        CompressedData)) {
  return 1;
}
T2 = std::chrono::high_resolution_clock::now();
outs()
    << "Compress data (microseconds): "
    << std::chrono::duration_cast<std::chrono::microseconds>(T2 - T1).count()
    << "\n";
outs() << "Compressed data size: " << CompressedData.size() << "\n";

object::Decompressor Decompressor(
    StringRef(CompressedData.data(), CompressedData.size()));
Decompressor.DecompressedSize = Size;

T1 = std::chrono::high_resolution_clock::now();
SmallString<0> DecompressedData;
Decompressor.resizeAndDecompress(DecompressedData);
T2 = std::chrono::high_resolution_clock::now();
outs()
    << "Decompress data (microseconds): "
    << std::chrono::duration_cast<std::chrono::microseconds>(T2 - T1).count()
    << "\n";

Selected an average result from 3 runs:

----------------------------------------
With `resize`:
----------------------------------------

Create data (microseconds): 16963852
Compress data (microseconds): 27999644
Compressed data size: 1074069335
Decompress data (microseconds): 836237

----------------------------------------
With `resize_for_overwrite`:
----------------------------------------

Create data (microseconds): 17025348
Compress data (microseconds): 27997363
Compressed data size: 1074069335
Decompress data (microseconds): 549606

836237 * 0.657 == 549606, i.e. the new code is about 35% faster for me.

jhenderson accepted this revision.Jan 20 2021, 12:32 AM

Looks fine to me in principle, though I don't use any compression/decompression functionality, so I'm not necessarily a fair judge. Maybe give a bit more time for someone else to comment.

This revision is now accepted and ready to land.Jan 20 2021, 12:32 AM