This is an archive of the discontinued LLVM Phabricator instance.

[mlir:Bytecode][NFC] Refactor string section writing and reading
ClosedPublic

Authored by rriddle on Aug 23 2022, 1:34 PM.

Download Raw Diff

Details

Reviewers

mehdi_amini
jpienaar

Commits

rG83dc9999486f: [mlir:Bytecode][NFC] Refactor string section writing and reading

Summary

This extracts the string section writer and reader into dedicated
classes, which better separates the logic and will also simplify future
patches that want to interact with the string section.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rriddle created this revision.Aug 23 2022, 1:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 1:34 PM

Herald added subscribers: bzcheeseman, sdasgup3, wenzhicui and 18 others. · View Herald Transcript

rriddle requested review of this revision.Aug 23 2022, 1:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 1:34 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

rriddle added a child revision: D132497: [mlir:Bytecode][NFC] Cleanup Attribute/Type reading.Aug 23 2022, 1:34 PM

rriddle added reviewers: mehdi_amini, jpienaar.Aug 23 2022, 1:35 PM

mehdi_amini added inline comments.Aug 23 2022, 1:57 PM

mlir/lib/Bytecode/Reader/BytecodeReader.cpp
295	Shall we check on load that there is no null char in the middle of the string?
302	Do you need totalStringDataSize? Isn't the same check if stringDataEndOffset == 0?
mlir/lib/Bytecode/Writer/BytecodeWriter.cpp
232	Still an opportunity to sort the string based on frequency, right? :)

mehdi_amini accepted this revision.Aug 23 2022, 1:57 PM

This revision is now accepted and ready to land.Aug 23 2022, 1:57 PM

rriddle marked 2 inline comments as done.Aug 23 2022, 3:27 PM

rriddle added inline comments.

mlir/lib/Bytecode/Reader/BytecodeReader.cpp
295	StringAttr supports strings with null characters, so I've just been allowing null holding strings in the string section (to simplify things). It doesn't affect loading at all given we don't use the null character for size computation. Do you think we should disallow that here?
mlir/lib/Bytecode/Writer/BytecodeWriter.cpp
232	Yeah. I've got a TODO for that in the followup Attr/Type patch as well, it should just require a bit of preprocessing (with no format changes). It wasn't clear in the beginning how annoying it would be to support, but after attr/type support I don't think it'll be too bad.

Harbormaster completed remote builds in B182915: Diff 454939.Aug 23 2022, 4:00 PM

rriddle updated this revision to Diff 455009.Aug 23 2022, 4:30 PM

rriddle marked an inline comment as done.

This revision was landed with ongoing or failed builds.Aug 23 2022, 4:56 PM

Closed by commit rG83dc9999486f: [mlir:Bytecode][NFC] Refactor string section writing and reading (authored by rriddle). · Explain Why

This revision was automatically updated to reflect the committed changes.

rriddle added a commit: rG83dc9999486f: [mlir:Bytecode][NFC] Refactor string section writing and reading.

Harbormaster completed remote builds in B182971: Diff 455009.Aug 23 2022, 7:21 PM

mehdi_amini added inline comments.Aug 24 2022, 1:45 AM

mlir/lib/Bytecode/Reader/BytecodeReader.cpp
295	Not necessarily, but I remember some doc on the bytecode that was explicit about null-terminated string which excluded null char?

Revision Contents

Path

Size

mlir/

lib/

Bytecode/

Reader/

BytecodeReader.cpp

128 lines

Writer/

BytecodeWriter.cpp

66 lines

Diff 455017

mlir/lib/Bytecode/Reader/BytecodeReader.cpp

Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines	static LogicalResult parseEntry(EncodingReader &reader, RangeT &entries,
T &entry, StringRef entryStr) {		T &entry, StringRef entryStr) {
uint64_t entryIdx;		uint64_t entryIdx;
if (failed(reader.parseVarInt(entryIdx)))		if (failed(reader.parseVarInt(entryIdx)))
return failure();		return failure();
return resolveEntry(reader, entries, entryIdx, entry, entryStr);		return resolveEntry(reader, entries, entryIdx, entry, entryStr);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// StringSectionReader
		//===----------------------------------------------------------------------===//

		namespace {
		/// This class is used to read references to the string section from the
		/// bytecode.
		class StringSectionReader {
		public:
		/// Initialize the string section reader with the given section data.
		LogicalResult initialize(Location fileLoc, ArrayRef<uint8_t> sectionData);

		/// Parse a shared string from the string section. The shared string is
		/// encoded using an index to a corresponding string in the string section.
		LogicalResult parseString(EncodingReader &reader, StringRef &result) {
		return parseEntry(reader, strings, result, "string");
		}

		private:
		/// The table of strings referenced within the bytecode file.
		SmallVector<StringRef> strings;
		};
		} // namespace

		LogicalResult StringSectionReader::initialize(Location fileLoc,
		ArrayRef<uint8_t> sectionData) {
		EncodingReader stringReader(sectionData, fileLoc);

		// Parse the number of strings in the section.
		uint64_t numStrings;
		if (failed(stringReader.parseVarInt(numStrings)))
		return failure();
		strings.resize(numStrings);

		// Parse each of the strings. The sizes of the strings are encoded in reverse
		// order, so that's the order we populate the table.
		size_t stringDataEndOffset = sectionData.size();
		for (StringRef &string : llvm::reverse(strings)) {
		uint64_t stringSize;
		if (failed(stringReader.parseVarInt(stringSize)))
		return failure();
		if (stringDataEndOffset < stringSize) {
		return stringReader.emitError(
		"string size exceeds the available data size");
		}

		// Extract the string from the data, dropping the null character.
		size_t stringOffset = stringDataEndOffset - stringSize;
		string = StringRef(
		reinterpret_cast<const char *>(sectionData.data() + stringOffset),
		stringSize - 1);
		stringDataEndOffset = stringOffset;
		}
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Shall we check on load that there is no null char in the middle of the string? mehdi_amini: Shall we check on load that there is no null char in the middle of the string?
		rriddleAuthorUnsubmitted Not Done Reply Inline Actions StringAttr supports strings with null characters, so I've just been allowing null holding strings in the string section (to simplify things). It doesn't affect loading at all given we don't use the null character for size computation. Do you think we should disallow that here? rriddle: StringAttr supports strings with null characters, so I've just been allowing null holding…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Not necessarily, but I remember some doc on the bytecode that was explicit about null-terminated string which excluded null char? mehdi_amini: Not necessarily, but I remember some doc on the bytecode that was explicit about null…

		// Check that the only remaining data was for the strings, i.e. the reader
		// should be at the same offset as the first string.
		if ((sectionData.size() - stringReader.size()) != stringDataEndOffset) {
		return stringReader.emitError("unexpected trailing data between the "
		"offsets for strings and their data");
		}
		mehdi_aminiUnsubmitted Done Reply Inline Actions Do you need totalStringDataSize? Isn't the same check if stringDataEndOffset == 0? mehdi_amini: Do you need totalStringDataSize? Isn't the same check if stringDataEndOffset == 0?
		return success();
		}

		//===----------------------------------------------------------------------===//
// BytecodeDialect		// BytecodeDialect
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
/// This struct represents a dialect entry within the bytecode.		/// This struct represents a dialect entry within the bytecode.
struct BytecodeDialect {		struct BytecodeDialect {
/// Load the dialect into the provided context if it hasn't been loaded yet.		/// Load the dialect into the provided context if it hasn't been loaded yet.
/// Returns failure if the dialect couldn't be loaded and the provided		/// Returns failure if the dialect couldn't be loaded and the provided
▲ Show 20 Lines • Show All 339 Lines • ▼ Show 20 Lines	FailureOr<Operation *> parseOpWithoutRegions(EncodingReader &reader,
RegionReadState &readState,		RegionReadState &readState,
bool &isIsolatedFromAbove);		bool &isIsolatedFromAbove);

LogicalResult parseRegion(EncodingReader &reader, RegionReadState &readState);		LogicalResult parseRegion(EncodingReader &reader, RegionReadState &readState);
LogicalResult parseBlock(EncodingReader &reader, RegionReadState &readState);		LogicalResult parseBlock(EncodingReader &reader, RegionReadState &readState);
LogicalResult parseBlockArguments(EncodingReader &reader, Block *block);		LogicalResult parseBlockArguments(EncodingReader &reader, Block *block);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// String Section

LogicalResult parseStringSection(ArrayRef<uint8_t> sectionData);

/// Parse a shared string from the string section. The shared string is
/// encoded using an index to a corresponding string in the string section.
LogicalResult parseSharedString(EncodingReader &reader, StringRef &result) {
return parseEntry(reader, strings, result, "string");
}

//===--------------------------------------------------------------------===//
// Value Processing		// Value Processing

/// Parse an operand reference using the given reader. Returns nullptr in the		/// Parse an operand reference using the given reader. Returns nullptr in the
/// case of failure.		/// case of failure.
Value parseOperand(EncodingReader &reader);		Value parseOperand(EncodingReader &reader);

/// Sequentially define the given value range.		/// Sequentially define the given value range.
LogicalResult defineValues(EncodingReader &reader, ValueRange values);		LogicalResult defineValues(EncodingReader &reader, ValueRange values);
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	private:
/// The producer of the bytecode being read.		/// The producer of the bytecode being read.
StringRef producer;		StringRef producer;

/// The table of IR units referenced within the bytecode file.		/// The table of IR units referenced within the bytecode file.
SmallVector<BytecodeDialect> dialects;		SmallVector<BytecodeDialect> dialects;
SmallVector<BytecodeOperationName> opNames;		SmallVector<BytecodeOperationName> opNames;

/// The table of strings referenced within the bytecode file.		/// The table of strings referenced within the bytecode file.
SmallVector<StringRef> strings;		StringSectionReader stringReader;

/// The current set of available IR value scopes.		/// The current set of available IR value scopes.
std::vector<ValueScope> valueScopes;		std::vector<ValueScope> valueScopes;
/// A block containing the set of operations defined to create forward		/// A block containing the set of operations defined to create forward
/// references.		/// references.
Block forwardRefOps;		Block forwardRefOps;
/// A block containing previously created, and no longer used, forward		/// A block containing previously created, and no longer used, forward
/// reference operations.		/// reference operations.
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	LogicalResult BytecodeReader::read(llvm::MemoryBufferRef buffer, Block *block) {
for (int i = 0; i < bytecode::Section::kNumSections; ++i) {		for (int i = 0; i < bytecode::Section::kNumSections; ++i) {
if (!sectionDatas[i]) {		if (!sectionDatas[i]) {
return reader.emitError("missing data for top-level section: ",		return reader.emitError("missing data for top-level section: ",
toString(bytecode::Section::ID(i)));		toString(bytecode::Section::ID(i)));
}		}
}		}

// Process the string section first.		// Process the string section first.
if (failed(parseStringSection(*sectionDatas[bytecode::Section::kString])))		if (failed(stringReader.initialize(
		fileLoc, *sectionDatas[bytecode::Section::kString])))
return failure();		return failure();

// Process the dialect section.		// Process the dialect section.
if (failed(parseDialectSection(*sectionDatas[bytecode::Section::kDialect])))		if (failed(parseDialectSection(*sectionDatas[bytecode::Section::kDialect])))
return failure();		return failure();

// Process the attribute and type section.		// Process the attribute and type section.
if (failed(attrTypeReader.initialize(		if (failed(attrTypeReader.initialize(
Show All 34 Lines	BytecodeReader::parseDialectSection(ArrayRef<uint8_t> sectionData) {
// Parse the number of dialects in the section.		// Parse the number of dialects in the section.
uint64_t numDialects;		uint64_t numDialects;
if (failed(sectionReader.parseVarInt(numDialects)))		if (failed(sectionReader.parseVarInt(numDialects)))
return failure();		return failure();
dialects.resize(numDialects);		dialects.resize(numDialects);

// Parse each of the dialects.		// Parse each of the dialects.
for (uint64_t i = 0; i < numDialects; ++i)		for (uint64_t i = 0; i < numDialects; ++i)
if (failed(parseSharedString(sectionReader, dialects[i].name)))		if (failed(stringReader.parseString(sectionReader, dialects[i].name)))
return failure();		return failure();

// Parse the operation names, which are grouped by dialect.		// Parse the operation names, which are grouped by dialect.
auto parseOpName = [&](BytecodeDialect *dialect) {		auto parseOpName = [&](BytecodeDialect *dialect) {
StringRef opName;		StringRef opName;
if (failed(parseSharedString(sectionReader, opName)))		if (failed(stringReader.parseString(sectionReader, opName)))
return failure();		return failure();
opNames.emplace_back(dialect, opName);		opNames.emplace_back(dialect, opName);
return success();		return success();
};		};
while (!sectionReader.empty())		while (!sectionReader.empty())
if (failed(parseDialectGrouping(sectionReader, dialects, parseOpName)))		if (failed(parseDialectGrouping(sectionReader, dialects, parseOpName)))
return failure();		return failure();
return success();		return success();
▲ Show 20 Lines • Show All 292 Lines • ▼ Show 20 Lines	while (numArgs--) {
argTypes.push_back(argType);		argTypes.push_back(argType);
argLocs.push_back(argLoc);		argLocs.push_back(argLoc);
}		}
block->addArguments(argTypes, argLocs);		block->addArguments(argTypes, argLocs);
return defineValues(reader, block->getArguments());		return defineValues(reader, block->getArguments());
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// String Section

LogicalResult
BytecodeReader::parseStringSection(ArrayRef<uint8_t> sectionData) {
EncodingReader stringReader(sectionData, fileLoc);

// Parse the number of strings in the section.
uint64_t numStrings;
if (failed(stringReader.parseVarInt(numStrings)))
return failure();
strings.resize(numStrings);

// Parse each of the strings. The sizes of the strings are encoded in reverse
// order, so that's the order we populate the table.
size_t stringDataEndOffset = sectionData.size();
size_t totalStringDataSize = 0;
for (StringRef &string : llvm::reverse(strings)) {
uint64_t stringSize;
if (failed(stringReader.parseVarInt(stringSize)))
return failure();
if (stringDataEndOffset < stringSize) {
return stringReader.emitError(
"string size exceeds the available data size");
}

// Extract the string from the data, dropping the null character.
size_t stringOffset = stringDataEndOffset - stringSize;
string = StringRef(
reinterpret_cast<const char *>(sectionData.data() + stringOffset),
stringSize - 1);
stringDataEndOffset = stringOffset;

// Update the total string data size.
totalStringDataSize += stringSize;
}

// Check that the only remaining data was for the strings
if (stringReader.size() != totalStringDataSize) {
return stringReader.emitError("unexpected trailing data between the "
"offsets for strings and their data");
}
return success();
}

//===----------------------------------------------------------------------===//
// Value Processing		// Value Processing

Value BytecodeReader::parseOperand(EncodingReader &reader) {		Value BytecodeReader::parseOperand(EncodingReader &reader) {
std::vector<Value> &values = valueScopes.back().values;		std::vector<Value> &values = valueScopes.back().values;
Value *value = nullptr;		Value *value = nullptr;
if (failed(parseEntry(reader, values, value, "value")))		if (failed(parseEntry(reader, values, value, "value")))
return Value();		return Value();

▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

mlir/lib/Bytecode/Writer/BytecodeWriter.cpp

Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	void EncodingEmitter::emitMultiByteVarInt(uint64_t value) {

// If the value is too large to encode in a single byte, emit a special all		// If the value is too large to encode in a single byte, emit a special all
// zero marker byte and splat the value directly.		// zero marker byte and splat the value directly.
emitByte(0);		emitByte(0);
emitBytes({reinterpret_cast<uint8_t *>(&value), sizeof(value)});		emitBytes({reinterpret_cast<uint8_t *>(&value), sizeof(value)});
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// StringSectionBuilder
		//===----------------------------------------------------------------------===//

		namespace {
		/// This class is used to simplify the process of emitting the string section.
		class StringSectionBuilder {
		public:
		/// Add the given string to the string section, and return the index of the
		/// string within the section.
		size_t insert(StringRef str) {
		auto it = strings.insert({llvm::CachedHashStringRef(str), strings.size()});
		return it.first->second;
		}

		/// Write the current set of strings to the given emitter.
		void write(EncodingEmitter &emitter) {
		emitter.emitVarInt(strings.size());

		// Emit the sizes in reverse order, so that we don't need to backpatch an
		// offset to the string data or have a separate section.
		for (const auto &it : llvm::reverse(strings))
		emitter.emitVarInt(it.first.size() + 1);
		// Emit the string data itself.
		for (const auto &it : strings)
		emitter.emitNulTerminatedString(it.first.val());
		}

		private:
		/// A set of strings referenced within the bytecode. The value of the map is
		/// unused.
		llvm::MapVector<llvm::CachedHashStringRef, size_t> strings;
		};
		} // namespace
		mehdi_aminiUnsubmitted Done Reply Inline Actions Still an opportunity to sort the string based on frequency, right? :) mehdi_amini: Still an opportunity to sort the string based on frequency, right? :)
		rriddleAuthorUnsubmitted Done Reply Inline Actions Yeah. I've got a TODO for that in the followup Attr/Type patch as well, it should just require a bit of preprocessing (with no format changes). It wasn't clear in the beginning how annoying it would be to support, but after attr/type support I don't think it'll be too bad. rriddle: Yeah. I've got a TODO for that in the followup Attr/Type patch as well, it should just require…

		//===----------------------------------------------------------------------===//
// Bytecode Writer		// Bytecode Writer
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
class BytecodeWriter {		class BytecodeWriter {
public:		public:
BytecodeWriter(Operation *op) : numberingState(op) {}		BytecodeWriter(Operation *op) : numberingState(op) {}

Show All 19 Lines	private:
void writeRegion(EncodingEmitter &emitter, Region *region);		void writeRegion(EncodingEmitter &emitter, Region *region);
void writeIRSection(EncodingEmitter &emitter, Operation *op);		void writeIRSection(EncodingEmitter &emitter, Operation *op);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Strings		// Strings

void writeStringSection(EncodingEmitter &emitter);		void writeStringSection(EncodingEmitter &emitter);

/// Get the number for the given shared string, that is contained within the
/// string section.
size_t getSharedStringNumber(StringRef str);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Fields		// Fields

		/// The builder used for the string section.
		StringSectionBuilder stringSection;

/// The IR numbering state generated for the root operation.		/// The IR numbering state generated for the root operation.
IRNumberingState numberingState;		IRNumberingState numberingState;

/// A set of strings referenced within the bytecode. The value of the map is
/// unused.
llvm::MapVector<llvm::CachedHashStringRef, size_t> strings;
};		};
} // namespace		} // namespace

void BytecodeWriter::write(Operation *rootOp, raw_ostream &os,		void BytecodeWriter::write(Operation *rootOp, raw_ostream &os,
StringRef producer) {		StringRef producer) {
EncodingEmitter emitter;		EncodingEmitter emitter;

// Emit the bytecode file header. This is how we identify the output as a		// Emit the bytecode file header. This is how we identify the output as a
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines

void BytecodeWriter::writeDialectSection(EncodingEmitter &emitter) {		void BytecodeWriter::writeDialectSection(EncodingEmitter &emitter) {
EncodingEmitter dialectEmitter;		EncodingEmitter dialectEmitter;

// Emit the referenced dialects.		// Emit the referenced dialects.
auto dialects = numberingState.getDialects();		auto dialects = numberingState.getDialects();
dialectEmitter.emitVarInt(llvm::size(dialects));		dialectEmitter.emitVarInt(llvm::size(dialects));
for (DialectNumbering &dialect : dialects)		for (DialectNumbering &dialect : dialects)
dialectEmitter.emitVarInt(getSharedStringNumber(dialect.name));		dialectEmitter.emitVarInt(stringSection.insert(dialect.name));

// Emit the referenced operation names grouped by dialect.		// Emit the referenced operation names grouped by dialect.
auto emitOpName = [&](OpNameNumbering &name) {		auto emitOpName = [&](OpNameNumbering &name) {
dialectEmitter.emitVarInt(getSharedStringNumber(name.name.stripDialect()));		dialectEmitter.emitVarInt(stringSection.insert(name.name.stripDialect()));
};		};
writeDialectGrouping(dialectEmitter, numberingState.getOpNames(), emitOpName);		writeDialectGrouping(dialectEmitter, numberingState.getOpNames(), emitOpName);

emitter.emitSection(bytecode::Section::kDialect, std::move(dialectEmitter));		emitter.emitSection(bytecode::Section::kDialect, std::move(dialectEmitter));
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Attributes and Types		// Attributes and Types
▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	void BytecodeWriter::writeIRSection(EncodingEmitter &emitter, Operation *op) {
emitter.emitSection(bytecode::Section::kIR, std::move(irEmitter));		emitter.emitSection(bytecode::Section::kIR, std::move(irEmitter));
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Strings		// Strings

void BytecodeWriter::writeStringSection(EncodingEmitter &emitter) {		void BytecodeWriter::writeStringSection(EncodingEmitter &emitter) {
EncodingEmitter stringEmitter;		EncodingEmitter stringEmitter;
stringEmitter.emitVarInt(strings.size());		stringSection.write(stringEmitter);

// Emit the sizes in reverse order, so that we don't need to backpatch an
// offset to the string data or have a separate section.
for (const auto &it : llvm::reverse(strings))
stringEmitter.emitVarInt(it.first.size() + 1);
// Emit the string data itself.
for (const auto &it : strings)
stringEmitter.emitNulTerminatedString(it.first.val());

emitter.emitSection(bytecode::Section::kString, std::move(stringEmitter));		emitter.emitSection(bytecode::Section::kString, std::move(stringEmitter));
}		}

size_t BytecodeWriter::getSharedStringNumber(StringRef str) {
auto it = strings.insert({llvm::CachedHashStringRef(str), strings.size()});
return it.first->second;
}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Entry Points		// Entry Points
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

void mlir::writeBytecodeToFile(Operation *op, raw_ostream &os,		void mlir::writeBytecodeToFile(Operation *op, raw_ostream &os,
StringRef producer) {		StringRef producer) {
BytecodeWriter writer(op);		BytecodeWriter writer(op);
writer.write(op, os, producer);		writer.write(op, os, producer);
}		}