This is an archive of the discontinued LLVM Phabricator instance.

Bitcode: use blob for string storage in the IR: trade a bit of space for faster reading
AbandonedPublic

Authored by mehdi_amini on Jan 17 2016, 1:55 PM.

Details

Summary

The bitcode store string in a compressed fashion. This is nice for space, but
it is expensive to encode/decode. I experimented using the "blob" type in the
bitcode, and obtained non-negligible speedup.
The following patch provides 10% speedup on the execution of getLazyIRModule()
when linking "opt" with ThinLTO. The on-disk size is also increased by ~10%.

I left it under a flag in the writer for experiment purpose, but I'm seeking
opinion on the subject.

Diff Detail

Event Timeline

mehdi_amini retitled this revision from to Bitcode: use blob for string storage in the IR: trade a bit of space for faster reading.
mehdi_amini updated this object.
mehdi_amini added a subscriber: llvm-commits.
filcab added a subscriber: filcab.Jan 18 2016, 8:22 AM
filcab added inline comments.
include/llvm/Support/StreamingMemoryObject.h
34

What happens if there's two BLOBs in the stream? Wouldn't you overwrite one with the other?

mehdi_amini added inline comments.Jan 18 2016, 10:59 AM
include/llvm/Support/StreamingMemoryObject.h
34

Yes the validity of the pointer returned last only till the next read from the stream.
The model is that there will be a copy made by the client anyway. But with blob won't "unpack" 6 bits elements to an array of unsigned, and then decode the 6 bits encoding to char, and then do the copy.

Note also that I haven't find another place than llvm-dis that uses this code-path.

Have you tried running clang with this patch? If possible, with ASan on.

include/llvm/Support/StreamingMemoryObject.h
34

Have you tried clang too?

Wouldn't this code (in clang) break (I added comments)?

case SM_SLOC_BUFFER_ENTRY: {
  const char *Name = Blob.data();           // <- Getting a ref to the current blob
  unsigned Offset = Record[0];
  SrcMgr::CharacteristicKind
    FileCharacter = (SrcMgr::CharacteristicKind)Record[2];
  SourceLocation IncludeLoc = ReadSourceLocation(*F, Record[1]);
  if (IncludeLoc.isInvalid() &&
      (F->Kind == MK_ImplicitModule || F->Kind == MK_ExplicitModule)) {
    IncludeLoc = getImportLocation(F);
  }
  unsigned Code = SLocEntryCursor.ReadCode();
  Record.clear();
  unsigned RecCode
    = SLocEntryCursor.readRecord(Code, Record, &Blob);     // <- That old blob reference is now invalid

  if (RecCode != SM_SLOC_BUFFER_BLOB) {
    Error("AST record has invalid code");
    return true;
  }

  std::unique_ptr<llvm::MemoryBuffer> Buffer =
      llvm::MemoryBuffer::getMemBuffer(Blob.drop_back(1), Name);    // <- Use ref to first blob
  SourceMgr.createFileID(std::move(Buffer), FileCharacter, ID,
                         BaseOffset + Offset, IncludeLoc);
  break;
}

"Nothing" in "llvm only" uses Blobs, basically (llvm-bcanalyzer does get one and dump it :-) ). But clang uses them a lot.

tejohnson edited edge metadata.Jan 19 2016, 7:32 AM

Was the improvement in speed measured on top of your DecodeChar6 change? If not, do you still see a nice improvement?

include/llvm/Support/StreamingMemoryObject.h
34

The old streaming getPointer took a fatal error, so presumably the clang code wasn't invoking it during streaming, so wouldn't break with Mehdi's change. Or is the concern that other code like this may creep in for StreamingMemoryObject?

lib/Bitcode/Reader/BitcodeReader.cpp
1732

There's a lot of code duplication here with the other recordValue(). Looks like the other recordValue() could get the Name string and invoke this one. Or just keep this one and if Name is empty here, invoke convertToString.

2675

Looks like these two cases could be collapsed as they were before. Is it clearer with them separated?

lib/Bitcode/Writer/BitcodeWriter.cpp
1514

This should be CST_CODE_STRING. Also, change comment to distinguish from original CST_CODE_STRING case above?

1531

Change comment to distinguish from original CST_CODE_CSTRING case above?

1643

Move up and make other cases "else if"

2300

Needs comment.

2337

Move this up and make the other case "else if"

2360

Move this up and make the other cases "else if"

2619

Needs comment.

filcab added inline comments.Jan 19 2016, 8:43 AM
include/llvm/Support/StreamingMemoryObject.h
34

True, the Streaming version is not being used in clang for sure. And if it's used by someone else (PNaCl?), it's not a problem now, so it can't be a problem after Mehdi's patch.

Mehdi: Instead of keeping the assert like I was saying (which is basically there to block code paths), let's just do your change. But please add some documentation stating that the interface to getPointer only guarantees that it's valid until the next call to it, and that the caller needs to either use it before then, or copy it. That way it's explicitly stated in the comment.

Thank you.

Thanks for the comments, the measurements were made after the decodeChar6 changes.

lib/Bitcode/Reader/BitcodeReader.cpp
1732

I like you second suggestion! Thanks.

2675

The CST_CODE_CSTRING has to add a 0 at the end of the string. This was hidden in ConstantDataArray::getString at the price of an extra copy.
This is why I splitted it to save the SmallString copy in the non-blob case for String.
I just notice now that I still have a Smallstring for the STRING case, I think I can remove this.

lib/Bitcode/Writer/BitcodeWriter.cpp
1643

The OptSpeed flag is hacked a bit everywhere, I'll clean it but I wasn't sure about the general feeling with this patch?

In D16277#330142, @joker.eph wrote:

Thanks for the comments, the measurements were made after the decodeChar6 changes.

Do you know where the current hotspot is? Do the strings have such higher overhead than blobs because of the copying required in the BitstreamReader when decoding? I wouldn't think that your new version of decodeChar6 would be very expensive. If it is due to the copying, what about an alternate approach, where the string is returned as is without copying/decoding (like a blob), but the decoding is done in the BitcodeReader or whatever it calls that copies the blob (e.g. ConstantDataArray::getString for constants, and does Value::setName do the copying for VST entries in the blob case?).

That's more invasive, but 10% is a pretty big on-disk size increase.

Another question - do you know the breakdown between the different record types? This should be pretty easily extracted from the llvm-bcanalyzer record histograms. A couple thoughts:

  • If it is dominated by the VST FNENTRY records, then I would imagine that reading in the combined function index file in the ThinLTO compile would also suffer from the same problem. Do we need to do something there too?
  • If it is module-level constant strings and VST ENTRY records (for declarations), I wonder if there is some kind of lazy reading/decoding we could do for these (they presumably aren't all needed by imported functions.
  • If it is function level BBENTRY and ENTRY records (presumably for the imported functions since we wouldn't parse the others), then I wonder if we can get some savings (both time and space) by using a string table. For a simple example I looked at, I had two functions each containing a call to printf that each had a BBENTRY record with the string "entry" and ENTRY record with the string "call". I would imagine the former in particular is duplicated quite frequently. I saw BBENTRY records in another simple function I looked at that had other very common sounding strings such as "return", "if.else", "if.then", "retval", etc. Ditto for ENTRY records for local values/parameters which might have the same name in multiple functions.
In D16277#330142, @joker.eph wrote:

Thanks for the comments, the measurements were made after the decodeChar6 changes.

Do you know where the current hotspot is? Do the strings have such higher overhead than blobs because of the copying required in the BitstreamReader when decoding? I wouldn't think that your new version of decodeChar6 would be very expensive. If it is due to the copying, what about an alternate approach, where the string is returned as is without copying/decoding (like a blob), but the decoding is done in the BitcodeReader or whatever it calls that copies the blob (e.g. ConstantDataArray::getString for constants, and does Value::setName do the copying for VST entries in the blob case?).

The problem is that the bitcode is not byte aligned. Are you suggesting that we store the strings as "blob", correctly aligned, but potentially "char6 encoded"?

Another question - do you know the breakdown between the different record types? This should be pretty easily extracted from the llvm-bcanalyzer record histograms. A couple thoughts:

Haven't looked, I have shifted to debug info in the meantime ;)

  • If it is dominated by the VST FNENTRY records, then I would imagine that reading in the combined function index file in the ThinLTO compile would also suffer from the same problem. Do we need to do something there too?

In my use case, the combine function index is only built in memory and never stored on disk.

  • If it is module-level constant strings and VST ENTRY records (for declarations), I wonder if there is some kind of lazy reading/decoding we could do for these (they presumably aren't all needed by imported functions.

Yes we're not lazy enough in many places, but it is not trivial to implement though. :(

  • If it is function level BBENTRY and ENTRY records (presumably for the imported functions since we wouldn't parse the others), then I wonder if we can get some savings (both time and space) by using a string table. For a simple example I looked at, I had two functions each containing a call to printf that each had a BBENTRY record with the string "entry" and ENTRY record with the string "call". I would imagine the former in particular is duplicated quite frequently. I saw BBENTRY records in another simple function I looked at that had other very common sounding strings such as "return", "if.else", "if.then", "retval", etc. Ditto for ENTRY records for local values/parameters which might have the same name in multiple functions.

Bitcode is lazy loaded, so Function were not parsed. I suspect the same problem would show up anywhere. Reading elements per elements out of the bitcode stream is far too expensive. The BitstreamCursor has to keep track of the bit shifting state and co.

I experimented storing all the global metadata in single blob record, that is serialized with a FlatBuffer (https://github.com/google/flatbuffers): my ThinLTO importing gets ~10% faster.

Note: I leave this patch here as a straw man, but I don't actually plan to include it as is. I think the way forward it to completely break down the bitcode encoding decoupling serialization and compression.

mehdi_amini abandoned this revision.Mar 31 2016, 9:08 AM

Obsolete per r264551

My comment was wrong: this is not obsolete per r264551, the current one is about every non-metadata string while r264551 is about MDString