This is an archive of the discontinued LLVM Phabricator instance.

Introduce StructuredData
Needs ReviewPublic

Authored by nhaehnle on May 11 2023, 7:36 AM.

Details

Summary

StructuredData is a fairly general mechanism for representing extensible
data in textual IR and in bitcode.

It is intended primarily as an abstraction layer used during IR printing,
parsing and (bitcode de-)serialization.

However, it can also be used to preserve structured data as-is for
extension points in tools where a particular extension isn't understood.

Possible use cases range over:

  • Encoding TargetTypeInfo -- this is the use case that initially triggered this development
  • Extensible human-readable and compile-time efficient metadata a la debug info metadata
  • Human-readable and compile-time efficient modifiers on extended instructions / intrinsics

This change already includes rudimentary printing and parsing support.
Bitcode reading and writing will follow in a subsequent change.

Diff Detail

Event Timeline

nhaehnle created this revision.May 11 2023, 7:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2023, 7:36 AM
nhaehnle requested review of this revision.May 11 2023, 7:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2023, 7:36 AM

Why does the symbol map need to be global, shouldn’t it by per LLVMContext (which would also get rid of the locking)?

Needs tests.

llvm/docs/LangRef.rst
3527–3531

Can constants follow the way metadata (and most other things in LLVM) is encoded and prefix the type?
I.e. i32 <integer> and i1 true/false

Thanks for taking a look!

Why does the symbol map need to be global, shouldn’t it by per LLVMContext (which would also get rid of the locking)?

It's global so that we can cheaply compare an sdata::Symbol against an sdata::RegisterSymbol. This is used e.g. here: https://github.com/nhaehnle/llvm-project/blob/6bb5923a059c1120e006cb0d0cd63c5ef0806c0e/llvm/lib/IR/Type.cpp#L844

Needs tests.

What kind of tests do you have in mind? This change doesn't really expose anything, and there are tests in the followup change in the stack.

llvm/docs/LangRef.rst
3527–3531

Good question. I've been going back and forth on this, and the current version is as-is mostly because the various !DIxyz metadata doesn't have those prefixes. But that may not be the best example to follow.

I'd be happy to change it to be more in line with metadata. Any other opinions?

What kind of tests do you have in mind?

I thought about tests for the parsing and printing code. I missed that the code is not yet used here, so please ignore my comment from before :)

Having global state makes things easy at the start (and efficient in the case here), but I fear we will run into problems later, for example when unloading and reloading libLLVM. There is a precedent for globals, command line arguments are global state and it is problematic.

The TargetTypeInfoDeserialize::registerSymbols() call is in the LLVMContext constructor, so it does not seem far-fetched to me to make symbols part of the context.

llvm/docs/LangRef.rst
3527–3531

But that may not be the best example to follow.

I once tried to write a parser for !DI metadata (tree-sitter, for syntax highlighting) and gave up because there were more and more edge-cases, so I’d be happy to see a more structured format ;)

nhaehnle marked an inline comment as done.Jun 20 2023, 4:52 AM
nhaehnle added inline comments.
llvm/docs/LangRef.rst
3527–3531

I'm going ahead and changing it to have the iN prefix. As a reasonable (I believe) consequence, the bit width is now actually part of the value, so if a user wants to distinguish between i7 0 and i9 0, they can.

nhaehnle updated this revision to Diff 532900.Jun 20 2023, 6:50 AM

Add iN prefix to integers (and bool) in structured data values

I am very concerned about making this a global structure, rather than something bound to the context.

More generally, I'm not happy that this new concept is being introduced as part of the target type implementation. This doesn't really seem helpful for this specific use case (it makes the implementation substantially more complex rather than simpler). It may well make sense as part of some larger context, but I think this larger context deserves a wider discussion (probably on discourse) to clarify what the goals of this abstraction are and make sure we have a good design for it.

More generally, I'm not happy that this new concept is being introduced as part of the target type implementation. This doesn't really seem helpful for this specific use case (it makes the implementation substantially more complex rather than simpler). It may well make sense as part of some larger context, but I think this larger context deserves a wider discussion (probably on discourse) to clarify what the goals of this abstraction are and make sure we have a good design for it.

Sure. It's a bit of a chicken-and-egg thing. I do think this approach is better than the piecemeal addition of things to bitcode reader/writer, which is rather subtle code that I don't think too many people understand, but at the same time, working on big changes is frustrating unless you have clear line-of-sight to actually getting something useful upstream. So I started with this small thing quite consciously on purpose.

I have a WIP change locally that leverages the same infrastructure for "extended metadata", and at least an early minimal version that does something interesting is something that I think I can put up reasonably soon so that the discussion doesn't end up overly abstract. I would also like to do the same thing for "extended instructions", but that's still a bit further out.

I am very concerned about making this a global structure, rather than something bound to the context.

You mean the symbol registration? That's ultimately a performance tradeoff. I wanted to avoid having too many string comparisons in potential hot paths, which requires interning, which requires some way to get at the intern'd ID that isn't too horrible. It would be good to understand what the actual concerns are with it; the intention is that the size of this structure is bounded by the number of static variables, so there isn't an urgent need (or even ability) to reclaim anything. But I would be happier about it if I had a way to enforce that.

nikic added a comment.Jun 20 2023, 2:03 PM

More generally, I'm not happy that this new concept is being introduced as part of the target type implementation. This doesn't really seem helpful for this specific use case (it makes the implementation substantially more complex rather than simpler). It may well make sense as part of some larger context, but I think this larger context deserves a wider discussion (probably on discourse) to clarify what the goals of this abstraction are and make sure we have a good design for it.

Sure. It's a bit of a chicken-and-egg thing. I do think this approach is better than the piecemeal addition of things to bitcode reader/writer, which is rather subtle code that I don't think too many people understand, but at the same time, working on big changes is frustrating unless you have clear line-of-sight to actually getting something useful upstream. So I started with this small thing quite consciously on purpose.

I have a WIP change locally that leverages the same infrastructure for "extended metadata", and at least an early minimal version that does something interesting is something that I think I can put up reasonably soon so that the discussion doesn't end up overly abstract. I would also like to do the same thing for "extended instructions", but that's still a bit further out.

FWIW, I do think this is pretty reasonable when seen as a pure IR/bitcode abstraction. I wouldn't have concerns if this were (initially at least) represented as std::pair<StringRef, sdata::Value> and only used as part of reading/writing. That use case doesn't really need any of the Symbol machinery -- we get these as strings in IR/bitcode anyway, and I don't think there is a substantial performance difference between interning the strings and then comparing IDs compared to directly comparing to a small handful of strings, during parsing only.

The design you have makes a lot more sense if the sdata is supposed to be used as part of the in-memory IR as well, in which case the Symbols are important for more efficient access -- but I'm not sure if they are ideal in that context either. That would still require that we have Symbols and Values (std::variant, ugh) in the representation. Ideally we'd just have a C++ struct with properly typed members, and sdata only being used for the purposes of serializing/deserializing those structs using a general mechanism, without requiring new asm/bitcode support each time.

llvm/include/llvm/IR/StructuredData.h
26

Unused

177

Not used in this patch -- move to the last one?

nhaehnle updated this revision to Diff 533139.Jun 20 2023, 11:12 PM

Drop symbols-as-values for now

nhaehnle marked 2 inline comments as done.EditedJun 20 2023, 11:29 PM

More generally, I'm not happy that this new concept is being introduced as part of the target type implementation. This doesn't really seem helpful for this specific use case (it makes the implementation substantially more complex rather than simpler). It may well make sense as part of some larger context, but I think this larger context deserves a wider discussion (probably on discourse) to clarify what the goals of this abstraction are and make sure we have a good design for it.

Sure. It's a bit of a chicken-and-egg thing. I do think this approach is better than the piecemeal addition of things to bitcode reader/writer, which is rather subtle code that I don't think too many people understand, but at the same time, working on big changes is frustrating unless you have clear line-of-sight to actually getting something useful upstream. So I started with this small thing quite consciously on purpose.

I have a WIP change locally that leverages the same infrastructure for "extended metadata", and at least an early minimal version that does something interesting is something that I think I can put up reasonably soon so that the discussion doesn't end up overly abstract. I would also like to do the same thing for "extended instructions", but that's still a bit further out.

FWIW, I do think this is pretty reasonable when seen as a pure IR/bitcode abstraction. I wouldn't have concerns if this were (initially at least) represented as std::pair<StringRef, sdata::Value> and only used as part of reading/writing. That use case doesn't really need any of the Symbol machinery -- we get these as strings in IR/bitcode anyway, and I don't think there is a substantial performance difference between interning the strings and then comparing IDs compared to directly comparing to a small handful of strings, during parsing only.

The design you have makes a lot more sense if the sdata is supposed to be used as part of the in-memory IR as well, in which case the Symbols are important for more efficient access -- but I'm not sure if they are ideal in that context either. That would still require that we have Symbols and Values (std::variant, ugh) in the representation. Ideally we'd just have a C++ struct with properly typed members, and sdata only being used for the purposes of serializing/deserializing those structs using a general mechanism, without requiring new asm/bitcode support each time.

Hmm, perhaps that has been a case of premature optimization.

Indeed my intention is that "known" extended structures (in the general sense) are represented as C++ structs with properly typed members in live IR. The only exception are "generic" extended structures which are used as a fallback in-memory representation to preserve as a black-box any extension structures that are unknown. That is, in our graphics compiler my current thinking is for us to have extended metadata objects along the lines of:

!lgc.rasterizer.state { discardEnable: i1 true, perSampleShading: i1 false, ... }

In our compiler, these would be represented as an instance of an lgc::RasterizerStateMetadata class which is derived from an llvm::ExtMetadata base class and contains these fields as plain data. But when we write out intermediate IR with something like -stop-after and then run it through generic tools like opt, llvm-dis, etc., the same metadata object is represented by a llvm::GenericExtMetadata which just holds an array of std::pair<sdata::Symbol, sdata::Value>. But perhaps that could just be a std::pair<std::string, sdata::Value> instead because nobody really accesses these fields anyway.

I'll think about this some more but removing the Symbol stuff and just replacing by StringRef is starting to sound reasonable to me.

llvm/include/llvm/IR/StructuredData.h
26

(only MDNode is truly unused, removing that)

177

Sure

nhaehnle updated this revision to Diff 533147.Jun 20 2023, 11:42 PM
nhaehnle marked 2 inline comments as done.

Move SchemaField out of this patch

nhaehnle updated this revision to Diff 533499.Jun 22 2023, 1:25 AM

Main change is removal of sdata::Symbol in favor of plain StringRef

I think this larger context deserves a wider discussion

I started a thread on Discourse: https://discourse.llvm.org/t/rfc-structured-data-for-extensibility-in-llvm-ir/71527

arsenm added a subscriber: arsenm.Jun 22 2023, 2:59 AM

Missing all the assembler tests (but I guess they come with the bitcode parts?)

llvm/include/llvm/IR/StructuredData.h
71

Does std::get on variant not do this for you?

llvm/lib/IR/AsmWriter.cpp
1345

seems more like a report_fatal_error kind of case

Missing all the assembler tests (but I guess they come with the bitcode parts?)

Yeah, the assembler tests are with the actual use cases of this infrastructure in the next patch.

llvm/include/llvm/IR/StructuredData.h
71

std::get throws on error and we disable exceptions in LLVM. I'm not sure what that means in practice so thought it safest to add the assertion.

llvm/lib/IR/AsmWriter.cpp
1345

Really? I thought report_fatal_error was primarily for states that can be reached by bad input. This state here cannot be reached by bad input, but only by somebody extending sdata::Value and then forgetting to change this code.

nikic added a comment.Jun 27 2023, 3:49 AM

Thanks for providing the context for this functionality. The updated patch looks good to me. I think even if we never end up introducing more uses for this, this is not going to be an undue burden, so I think it's okay to move forward with the target type use case.

llvm/include/llvm/IR/StructuredData.h
19

DenseMap is not used.

37

Why do we need this class to be default-constructible? (I presume this is also why there is a monostate in the storage?)

llvm/lib/AsmParser/LLParser.cpp
4213

Okay, I guess this is where the default initialization is used. Possibly this should be std::optional instead -- or parse the sdata value in a separate function and return it?

llvm/lib/IR/StructuredData.cpp
24

Structured

khei4 added a subscriber: khei4.Jun 27 2023, 8:50 PM