This is an archive of the discontinued LLVM Phabricator instance.

Make sure BitcodeWriter works with Unicode characters
AbandonedPublic

Authored by loladiro on Nov 9 2014, 1:16 PM.

Download Raw Diff

Details

Reviewers: None

Summary

Previously when a metadata string contained unicode characters,
it would be incorrectly placed in the Record array because chars
are signed by default and hence characters with the high bit set
would get sign extended, but the bitcode writer was attempting
to write the lowest 8 bit of the now sign-extended value. This
caused an assertion failure later on. The fix is just to cast
the pointer to uint8_t* first to prevent sign extension.
This came up in the context for metadata strings, but I did a
quick pass and changed the other instances of this pattern in
the file as well.

Diff Detail

Event Timeline

loladiro updated this revision to Diff 15959.Nov 9 2014, 1:16 PM

loladiro retitled this revision from to Make sure BitcodeWriter works with Unicode characters.

loladiro updated this object.

loladiro edited the test plan for this revision. (Show Details)

loladiro set the repository for this revision to rL LLVM.

loladiro added a subscriber: Unknown Object (MLST).

I'm a bit confused why we're widening chars into 64 bit values - is there a quick explanation for that? (it seems inefficient to put one char in each 64 bit entry in Record, rather than putting 8 of them in there)

If Record is actually bytes, it has the wrong type, doesn't it - it should be SmallString, or SmallVector<uint8_t>, etc...

As far as I understand, the bitcode supports arbitrarily sized fields defined at runtime, so everything goes through uint64_t.

In D6184#6, @loladiro wrote:

As far as I understand, the bitcode supports arbitrarily sized fields defined at runtime, so everything goes through uint64_t.

Seems a bit strange but certainly getting out of my depth - thanks for the explanation :)

loladiro abandoned this revision.Dec 27 2014, 4:58 AM

Revision Contents

Path

Size

lib/

Bitcode/

Writer/

BitcodeWriter.cpp

8 lines

test/

Bitcode/

unicode.ll

11 lines

Diff 15959

lib/Bitcode/Writer/BitcodeWriter.cpp

Context not available.
	StringRef Val = Attr.getValueAsString();	StringRef Val = Attr.getValueAsString();

	Record.push_back(Val.empty() ? 3 : 4);	Record.push_back(Val.empty() ? 3 : 4);
	Record.append(Kind.begin(), Kind.end());	Record.append((uint8_t)Kind.begin(), (uint8_t)Kind.end());
	Record.push_back(0);	Record.push_back(0);
	if (!Val.empty()) {	if (!Val.empty()) {
	Record.append(Val.begin(), Val.end());	Record.append((uint8_t)Val.begin(), (uint8_t)Val.end());
	Record.push_back(0);	Record.push_back(0);
	}	}
	}	}
Context not available.
	}	}

	// Code: [strchar x N]	// Code: [strchar x N]
	Record.append(MDS->begin(), MDS->end());	Record.append((uint8_t )MDS->begin(), (uint8_t )MDS->end());

	// Emit the finished record.	// Emit the finished record.
	Stream.EmitRecord(bitc::METADATA_STRING, Record, MDSAbbrev);	Stream.EmitRecord(bitc::METADATA_STRING, Record, MDSAbbrev);
Context not available.
	for (unsigned MDKindID = 0, e = Names.size(); MDKindID != e; ++MDKindID) {	for (unsigned MDKindID = 0, e = Names.size(); MDKindID != e; ++MDKindID) {
	Record.push_back(MDKindID);	Record.push_back(MDKindID);
	StringRef KName = Names[MDKindID];	StringRef KName = Names[MDKindID];
	Record.append(KName.begin(), KName.end());	Record.append((uint8_t)KName.begin(), (uint8_t)KName.end());

	Stream.EmitRecord(bitc::METADATA_KIND, Record, 0);	Stream.EmitRecord(bitc::METADATA_KIND, Record, 0);
	Record.clear();	Record.clear();
Context not available.

test/Bitcode/unicode.ll

This file was added.

				; RUN: llvm-as < %s \| llvm-dis \| FileCheck %s

				!llvm.dbg.cu = !{!0}
				!llvm.module.flags = !{!5}

				!0 = metadata !{metadata !"0x11\0012\00clang version ☃\001\00\000\00\000", metadata !4, metadata !2, metadata !2, metadata !2, metadata !2, null} ; [ DW_TAG_compile_unit ]
				; CHECK: "0x11\0012\00clang version \E2\98\83\001\00\000\00\000"
				!2 = metadata !{}
				!3 = metadata !{metadata !"0x29", metadata !4} ; [ DW_TAG_file_type ]
				!4 = metadata !{metadata !"empty.c", metadata !"/tmp"}
				!5 = metadata !{i32 1, metadata !"Debug Info Version", i32 2}