This is an archive of the discontinued LLVM Phabricator instance.

[TableGen] RegisterInfo backend - Add abstraction layer between code generation logic and syntax output
Needs RevisionPublic

Authored by Rot127 on Nov 18 2022, 12:37 PM.

Details

Summary

This patch introduces an abstraction layer for TableGen backends. It separates the code generation logic and the syntax string output into separated modules.
One module is the backend (RegisterInfo in this patch), the other PrinterLLVM.

The code generation stays in the TableGen backend. The syntax generation however is moved to PrinterLLVM, which writes the result to the output stream.
Whenever the backend has generated a piece of code it requests PrinterLLVM to write the syntax to the output stream.

Before this patch the TableGen backend generated the syntax on its own.
This had the disadvantage that no other languages, then the default C++ syntax, could be output easily.
With the separation of the code generation (backend) and syntax generation (PrinterLLVM) it is easier to output syntax of other languages.
It is also possible write a printer which emits only specific parts of the generated code.

A new printer only implements the methods of the PrinterLLVM it needs and emits its language syntax.

In this patch the RegisterInfo backend was refactored to use PrinterLLVM as output module.
If this gets upstreamed I can provide the refactor for the following modules in the next days:

  • DecoderEmitter
  • SubtargetInfo
  • InstructionInfo
  • AsmMatcher

For more background info and where this idea is coming from you can take a look at:

_Feedback implementation_
https://reviews.llvm.org/D136808

_Discussion in discourse.llvm.org_
https://discourse.llvm.org/t/comments-needed-for-refactoring-decoderemitter-tablegen-backend/65738

Diff Detail

Event Timeline

Rot127 created this revision.Nov 18 2022, 12:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 18 2022, 12:37 PM
Rot127 requested review of this revision.Nov 18 2022, 12:37 PM
Rot127 added a comment.EditedNov 18 2022, 1:48 PM

To test the output after the refactor I used the following simple scripts:

Generate code files of all architectures.
This needs to be run twice. Once with a build on the main branch (remove the --printerLang=C++ option and the -new in the output filename).

And once with a build with this patch (add the -new in the output filename and --printerLang=C++ option` again)

#!/bin/bash
LLVM_PATH="<path>"
for ARCH in AArch64 AMDGPU ARC ARM AVR BPF CSKY DirectX Hexagon Lanai LoongArch M68k Mips MSP430 NVPTX PPC RISCV Sparc SPIRV SystemZ VE XCore; do
  echo $ARCH
  if [ $ARCH = "PPC" ]; then
    ARCH_D="PowerPC"
  else
    ARCH_D=$ARCH
  fi
  $LLVM_PATH/build/bin/llvm-tblgen --gen-register-info --printerLang=C++ -I $LLVM_PATH/llvm/include/ -I $LLVM_PATH/llvm/lib/Target/$ARCH_D/ -o $LLVM_PATH/tblgen-comparisons/reginfo/$ARCH-new.inc $LLVM_PATH/llvm/lib/Target/$ARCH_D/$ARCH.td
done

Then diffing <Arch>.inc and <Arch>-new.inc with

#!/bin/bash
LLVM_PATH="<path>"
for ARCH in AArch64 AMDGPU ARC ARM AVR BPF CSKY DirectX Hexagon Lanai LoongArch M68k Mips MSP430 NVPTX PPC RISCV Sparc SPIRV SystemZ VE XCore; do
  echo $ARCH
  diff $LLVM_PATH/tblgen-comparisons/reginfo/$ARCH.inc $LLVM_PATH/tblgen-comparisons/reginfo/$ARCH-new.inc
  echo $ARCH DONE
done
Rot127 retitled this revision from {TableGen] RegisterInfo backend - Add abstraction layer between code generation logic and syntax output to [TableGen] RegisterInfo backend - Add abstraction layer between code generation logic and syntax output.Nov 22 2022, 8:16 AM
Rot127 added reviewers: resistor, grosbach, craig.topper.

I like your idea, but I'm not a big fan of the code structure. It seems that we have to add a plethora of virtual functions for every backend in a single PrinterInterface class ? Wouldn't this break the 'open-close principle' and 'Interface segregation principle' ?

Rot127 added a comment.EditedNov 24 2022, 7:02 AM

Wouldn't this break the 'open-close principle' and 'Interface segregation principle' ?

Indeed it does. How about removing the PrinterInterface and make the PrinterLLVM the base class for all extensions (and mark its methods as virtual). Other Printers inherit from PrinterLLVM and override the methods needed?
This wouldn't be a big change, but also keep the printing logic in one place and not spread around the backends.

Rot127 updated this revision to Diff 485707.Dec 30 2022, 9:48 AM
Rot127 edited the summary of this revision. (Show Details)

Removes the PrinterInterface. New language printer should simply inherit from PrinterLLVM.

arsenm added a subscriber: arsenm.Jan 12 2023, 9:13 AM
arsenm added inline comments.
llvm/utils/TableGen/Printer.h
74

Single quotes

100

Should use StringRef for all these const string parameters

llvm/utils/TableGen/PrinterLLVM.cpp
1419

Don't need a separate OS << on each line, can just break the one literal across multiple lines

rovka added a subscriber: rovka.Jan 13 2023, 2:07 AM

Wouldn't this break the 'open-close principle' and 'Interface segregation principle' ?

Indeed it does. How about removing the PrinterInterface and make the PrinterLLVM the base class for all extensions (and mark its methods as virtual). Other Printers inherit from PrinterLLVM and override the methods needed?
This wouldn't be a big change, but also keep the printing logic in one place and not spread around the backends.

I too have concerns about the design here. I'd rather see smaller, more specific classes, e.g. a RegEmitterPrinter abstract base class which can then be inherited by a RegEmitterCPPPrinter(*) and other similar RegEmitter<Lang>Printer. Having the printer for a non-C++ language inherit from the C++ printer seems suspicious - what happens if you only override a subset of the virtuals? Do you get random bursts of C++ code in your output? Using an abstract base class seems like a better choice since then if you forget to override a method (of if one is added to the base class and the subclasses aren't updated) then things should explode in an easy-to-understand/fix way.

Speaking of random bursts of C++ code, how do you handle TableGen backends that allow C++ snippets to be embedded (e.g. for the instruction selection, assuming the docs are up to date)?

Would be nice to get a few more opinions about this before reworking the whole patch :)

(*) We could also have a CommonCPPPrinter (maybe with a better name) that contains things that are the same for all the backends, e.g. emitNamespace and other things like that (although I guess that can be in a later patch).

llvm/utils/TableGen/Printer.h
82

Nit: Would be clearer to just have emitBeginNamespace and emitEndNamespace. Ditto below for emitIncludeToggle.

llvm/utils/TableGen/RegisterInfoEmitter.cpp
787

Nit: This looks unrelated, you should commit it separately.

llvm/utils/TableGen/SequenceToOffsetTable.h
115

I'd rather extract the printing code for this into a printer too.

Rot127 added a comment.EditedJan 14 2023, 2:15 PM

Thanks for the feedback so far!

I'd rather see smaller, more specific classes, e.g. a RegEmitterPrinter abstract base class which can then be inherited by a RegEmitterCPPPrinter(*) and other similar RegEmitter<Lang>Printer.

The disadvantage with a Printer class per backend is that we have ~2+ files for each backend. So quite a lot.
Although this might not so bad if each backend is moved to its own directory. Just like GlobalISel.

... what happens if you only override a subset of the virtuals? Do you get random bursts of C++ code in your output?

Personally I don't see the danger of unexpected C++ output.
If an introduction document is added how to add a new language Printer (regardless of the design I will definitely add one), it should be clear that all methods for a backend must be overriden.

Using an abstract base class seems like a better choice since then if you forget to override a method (of if one is added to the base class and the subclasses aren't updated) then things should explode in an easy-to-understand/fix way.

The abstract class version was the first implementation I added here. If we agree on backend specific printer classes I would like to go back to this as well.
But with a single Printer for all backends it would force each new language Printer to override methods for other backends as well. Although it only needs one.

(*) We could also have a CommonCPPPrinter (maybe with a better name) that contains things that are the same for all the backends, e.g. emitNamespace and other things like that (although I guess that can be in a later patch).

If each backend gets its own printer class they could simply inherit from such a PrinterCommon or PrinterBase.
So something like : PrinterBase <---- PrinterBackendX <---- PrinterLangY?

Speaking of random bursts of C++ code, how do you handle TableGen backends that allow C++ snippets to be embedded (e.g. for the instruction selection, assuming the docs are up to date)?

I haven't looked at it because for the Capstone use case we only needed the backends listed on top to be honest.
But I'll look into it and report back.

I'd rather extract the printing code for this into a printer too. (The comment to SequenceToOffsetTable.h)

Of cause. But SequenceToOffsetTable is used in a yet not refactored backend. So I left it there for the time being. But I agree that classes like this should be integrated into the Printer.

Would be nice to get a few more opinions about this before reworking the whole patch :)

Absolutely agree :)

nhaehnle requested changes to this revision.Jan 18 2023, 3:21 AM
nhaehnle added a subscriber: nhaehnle.

I appreciate your desire to extract useful information, but I think this is fundamentally the wrong approach to take.

The interface implied by PrintLLVM is at the wrong level of abstraction: will users who want to extract information really want to extract all this information, and if they do, will they want to do it in the same order and along the same hierarchy? What works for C++ might not work for another language and for another context. Another user of the information here is unlikely to need the exact set that is provided by the existing printer. And so on. In essence, the interface is exactly the wrong way around.

The code is already structured into a "database" part (CodeGenTarget) and a "printer" part, and the printer is quite intrinsically tied to C++ structures in many places. Ideally, you'd just use CodeGenTarget from a new printer, and potentially augment that class with additional features you may need.

The other meta problem here is that you're changing the code to be a lot less maintainable without an easily understandable in-tree motivation. For example: let's say we land this change. Soon after, anybody who looks at the code with these changes will immediately (and correctly) conclude that all those virtual methods aren't necessary and ought to be removed. Therefore, for out-of-tree purposes I would seriously urge you to just write a new printer that re-uses CodeGenTarget.

This would be different if you had an in-tree motivation for the changes.

llvm/utils/TableGen/SequenceToOffsetTable.h
115

I disagree. That's additional complexity that simply isn't needed at this stage.

This revision now requires changes to proceed.Jan 18 2023, 3:21 AM
Rot127 added a comment.EditedJan 19 2023, 12:04 PM

Thank you for the feedback!
Its my first time with LLVM so your critique is very valuable for me.

... will users who want to extract information really want to extract all this information,

Please see the example with the DecoderEmitter below. In which case the answer would be yes.

... and if they do, will they want to do it in the same order and along the same hierarchy? What works for C++ might not work for another language and for another context.

Since the Printer controls the complete output, the hierarchy and order problem could be solved by it. Although it probably ends up in complicated printers.
So yes, non C like languages couldn't be printed in an elegant way.

Ideally, you'd just use CodeGenTarget from a new printer, and potentially augment that class with additional features you may need.

I agree that this is the way to go for backends which emit mostly unprocessed information from the CodeGenTarget.
But it is complicated for backends like the DecoderEmitter.
This backend generates a state machine for decoding instructions and emits the C++ version of it. But the whole state machine generation logic is mixed together with the code printing.
So it is not possible to build a printer solely on top of the state-machine generation logic.

And maintaining an altered copy of it is maintenance heavy.

Would you say that in such a case this abstraction would be of use? Or should the state-machine generation be move to the CodeGenTarget in you opinion (as target-independent code generation algorithm I guess)?

My underling problem is that the Emitter backends don't just emit syntax. But quite often process (sometimes heavily) the information given by CodeGenTarget and the Records but don't make the result accessible.

The other meta problem here is that you're changing the code to be a lot less maintainable without an easily understandable in-tree motivation. For example: let's say we land this change. Soon after, anybody who looks at the code with these changes will immediately (and correctly) conclude that all those virtual methods aren't necessary and ought to be removed. Therefore, for out-of-tree purposes I would seriously urge you to just write a new printer that re-uses CodeGenTarget.

I could of cause add a Printer for C, but this is probably irrelevant as long as the other points are not solved.

trogdan removed a subscriber: trogdan.

Ideally, you'd just use CodeGenTarget from a new printer, and potentially augment that class with additional features you may need.

I agree that this is the way to go for backends which emit mostly unprocessed information from the CodeGenTarget.
But it is complicated for backends like the DecoderEmitter.
This backend generates a state machine for decoding instructions and emits the C++ version of it. But the whole state machine generation logic is mixed together with the code printing.
So it is not possible to build a printer solely on top of the state-machine generation logic.

And maintaining an altered copy of it is maintenance heavy.

Would you say that in such a case this abstraction would be of use? Or should the state-machine generation be move to the CodeGenTarget in you opinion (as target-independent code generation algorithm I guess)?

My underling problem is that the Emitter backends don't just emit syntax. But quite often process (sometimes heavily) the information given by CodeGenTarget and the Records but don't make the result accessible.

Yes, that's indeed an issue for what you're trying to do. IMHO it is also an issue for the maintainability of these backends as they are today. So I'd say we should try to separate the state-machine generation code from the printing code. It could be moved to CodeGenTarget, though I think it'd be better to keep things slightly more modular and instead split up the current DecoderEmitter into parts:

  • A Decoder class contains all the state-machine description. Think of it as an "intermediate representation for a decoder/disassembler": it contains the data describing the decoder, but essentially no logic.
  • A separate createDecoder function creates a Decoder given a CodeGenTarget
  • The emitDecoder function writes out the generated C++ for a given Decoder object

Does that make sense to you?

Yes, that's indeed an issue for what you're trying to do. IMHO it is also an issue for the maintainability of these backends as they are today. So I'd say we should try to separate the state-machine generation code from the printing code. It could be moved to CodeGenTarget, though I think it'd be better to keep things slightly more modular and instead split up the current DecoderEmitter into parts:

  • A Decoder class contains all the state-machine description. Think of it as an "intermediate representation for a decoder/disassembler": it contains the data describing the decoder, but essentially no logic.
  • A separate createDecoder function creates a Decoder given a CodeGenTarget
  • The emitDecoder function writes out the generated C++ for a given Decoder object

Does that make sense to you?

This is certainly one way of doing it. This would also fit loosely the current working (the the current DecoderEmitter works like this).

But personally I would start at a more fundamental redesign.

Preparation

  1. Enumerate backends which:
    • processes CodeGen information into more complex structures (e.g. decoder tables, lists of certain operands etc.).
    • mix up code generation and code emitting.
  2. For each backend identify what self-contained InfoBlocks (I lag a better name here) it produces
    • RegisterInfoEmitter: Produces register enums, Subregister information tables, Register classes info tables, algorithms to get register names from the enums etc.
    • AsmWriter: Instruction mnemonic tables, alias mnemonics, Mnemonic selection logic etc.
    • DecoderEmitter: Produces a complete decoder, with all its tables and state-machine traversal algorithms. ...

The design

The new design would consists of small modules which generate only one InfoBlock (a single enum, a single table, code to traverse a certain table).
Additionally we add a controller module which triggers the generation of different blocks.

Complicated InfoBlocks which consists effectively of multiple other InfoBlocks (e.g. the decoder needs instruction enums, the state machine algorithm, state machine tables etc.)
can have dependencies to other InfoBlocks.

For each InfoBlock I imagine a data structure like this:

struct {
  Name; // Unique name/ID of this information
  Data;  // The data, e.g. a vector with strings representing a register name enum or more complex stuff.
  DataType;  // Enumeration, Algorithm/Code, Table, StateMachine ...
  Dependencies;  // List of Names/IDs this InfoBlock depends on.

  void genInfo(CodeGenTarget Target, vector<InfoBlock> Dependencies); // Generates the information and stores it in `Data`.
}

Now, our new controller module would get a list of InfoBlocks. It triggers the generation of them and their dependencies.

If they need to be emitted, an Emitter can get those InfoBlocks and simply print them, without manipulating the data further. It can do this in whatever order, hierarchy and syntax it likes.

*Pros*

*Cons*

  • What counts as self-containing InfoBlock?
  • Danger of very similar modules. E.g. in the long run several InfoBlocks could end up generating *almost* the same information but not quite.
  • It is a ton of work.

With all that said, what do you think?

The very last point ("It is a ton of work") would be a problem for me. I decided because of that to design this patch as it is. Simply because I need to be done at some point.

So just to make sure, will the refactor (as in the diff here) has any future to get upstreamed?
Or will only a more fundamental change (moving logic to CodeGentTarget) be accepted?

0x59616e resigned from this revision.Apr 13 2023, 10:09 AM
Rot127 added a comment.EditedApr 25 2023, 4:48 AM

Since LLVM16 is now released, I wondered if anyone has more comments regarding my proposals to untangle the emitters.
Because if anyone sees this as useful I would be happy to sum up the discussion here and move it into the forum.

@nhaehnle I assume this revision will not be accepted?