Page MenuHomePhabricator

[TableGen] RegisterInfo backend - Add abstraction layer between code generation logic and syntax output
Needs RevisionPublic

Authored by Rot127 on Nov 18 2022, 12:37 PM.

Details

Summary

This patch introduces an abstraction layer for TableGen backends. It separates the code generation logic and the syntax string output into separated modules.
One module is the backend (RegisterInfo in this patch), the other PrinterLLVM.

The code generation stays in the TableGen backend. The syntax generation however is moved to PrinterLLVM, which writes the result to the output stream.
Whenever the backend has generated a piece of code it requests PrinterLLVM to write the syntax to the output stream.

Before this patch the TableGen backend generated the syntax on its own.
This had the disadvantage that no other languages, then the default C++ syntax, could be output easily.
With the separation of the code generation (backend) and syntax generation (PrinterLLVM) it is easier to output syntax of other languages.
It is also possible write a printer which emits only specific parts of the generated code.

A new printer only implements the methods of the PrinterLLVM it needs and emits its language syntax.

In this patch the RegisterInfo backend was refactored to use PrinterLLVM as output module.
If this gets upstreamed I can provide the refactor for the following modules in the next days:

  • DecoderEmitter
  • SubtargetInfo
  • InstructionInfo
  • AsmMatcher

For more background info and where this idea is coming from you can take a look at:

_Feedback implementation_
https://reviews.llvm.org/D136808

_Discussion in discourse.llvm.org_
https://discourse.llvm.org/t/comments-needed-for-refactoring-decoderemitter-tablegen-backend/65738

Diff Detail

Event Timeline

Rot127 created this revision.Nov 18 2022, 12:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 18 2022, 12:37 PM
Rot127 requested review of this revision.Nov 18 2022, 12:37 PM
Rot127 added a comment.EditedNov 18 2022, 1:48 PM

To test the output after the refactor I used the following simple scripts:

Generate code files of all architectures.
This needs to be run twice. Once with a build on the main branch (remove the --printerLang=C++ option and the -new in the output filename).

And once with a build with this patch (add the -new in the output filename and --printerLang=C++ option` again)

#!/bin/bash
LLVM_PATH="<path>"
for ARCH in AArch64 AMDGPU ARC ARM AVR BPF CSKY DirectX Hexagon Lanai LoongArch M68k Mips MSP430 NVPTX PPC RISCV Sparc SPIRV SystemZ VE XCore; do
  echo $ARCH
  if [ $ARCH = "PPC" ]; then
    ARCH_D="PowerPC"
  else
    ARCH_D=$ARCH
  fi
  $LLVM_PATH/build/bin/llvm-tblgen --gen-register-info --printerLang=C++ -I $LLVM_PATH/llvm/include/ -I $LLVM_PATH/llvm/lib/Target/$ARCH_D/ -o $LLVM_PATH/tblgen-comparisons/reginfo/$ARCH-new.inc $LLVM_PATH/llvm/lib/Target/$ARCH_D/$ARCH.td
done

Then diffing <Arch>.inc and <Arch>-new.inc with

#!/bin/bash
LLVM_PATH="<path>"
for ARCH in AArch64 AMDGPU ARC ARM AVR BPF CSKY DirectX Hexagon Lanai LoongArch M68k Mips MSP430 NVPTX PPC RISCV Sparc SPIRV SystemZ VE XCore; do
  echo $ARCH
  diff $LLVM_PATH/tblgen-comparisons/reginfo/$ARCH.inc $LLVM_PATH/tblgen-comparisons/reginfo/$ARCH-new.inc
  echo $ARCH DONE
done
Rot127 retitled this revision from {TableGen] RegisterInfo backend - Add abstraction layer between code generation logic and syntax output to [TableGen] RegisterInfo backend - Add abstraction layer between code generation logic and syntax output.Nov 22 2022, 8:16 AM
Rot127 added reviewers: resistor, grosbach, craig.topper.

I like your idea, but I'm not a big fan of the code structure. It seems that we have to add a plethora of virtual functions for every backend in a single PrinterInterface class ? Wouldn't this break the 'open-close principle' and 'Interface segregation principle' ?

Rot127 added a comment.EditedNov 24 2022, 7:02 AM

Wouldn't this break the 'open-close principle' and 'Interface segregation principle' ?

Indeed it does. How about removing the PrinterInterface and make the PrinterLLVM the base class for all extensions (and mark its methods as virtual). Other Printers inherit from PrinterLLVM and override the methods needed?
This wouldn't be a big change, but also keep the printing logic in one place and not spread around the backends.

Rot127 updated this revision to Diff 485707.Dec 30 2022, 9:48 AM
Rot127 edited the summary of this revision. (Show Details)

Removes the PrinterInterface. New language printer should simply inherit from PrinterLLVM.

arsenm added a subscriber: arsenm.Thu, Jan 12, 9:13 AM
arsenm added inline comments.
llvm/utils/TableGen/Printer.h
74

Single quotes

100

Should use StringRef for all these const string parameters

llvm/utils/TableGen/PrinterLLVM.cpp
1419

Don't need a separate OS << on each line, can just break the one literal across multiple lines

rovka added a subscriber: rovka.Fri, Jan 13, 2:07 AM

Wouldn't this break the 'open-close principle' and 'Interface segregation principle' ?

Indeed it does. How about removing the PrinterInterface and make the PrinterLLVM the base class for all extensions (and mark its methods as virtual). Other Printers inherit from PrinterLLVM and override the methods needed?
This wouldn't be a big change, but also keep the printing logic in one place and not spread around the backends.

I too have concerns about the design here. I'd rather see smaller, more specific classes, e.g. a RegEmitterPrinter abstract base class which can then be inherited by a RegEmitterCPPPrinter(*) and other similar RegEmitter<Lang>Printer. Having the printer for a non-C++ language inherit from the C++ printer seems suspicious - what happens if you only override a subset of the virtuals? Do you get random bursts of C++ code in your output? Using an abstract base class seems like a better choice since then if you forget to override a method (of if one is added to the base class and the subclasses aren't updated) then things should explode in an easy-to-understand/fix way.

Speaking of random bursts of C++ code, how do you handle TableGen backends that allow C++ snippets to be embedded (e.g. for the instruction selection, assuming the docs are up to date)?

Would be nice to get a few more opinions about this before reworking the whole patch :)

(*) We could also have a CommonCPPPrinter (maybe with a better name) that contains things that are the same for all the backends, e.g. emitNamespace and other things like that (although I guess that can be in a later patch).

llvm/utils/TableGen/Printer.h
82

Nit: Would be clearer to just have emitBeginNamespace and emitEndNamespace. Ditto below for emitIncludeToggle.

llvm/utils/TableGen/RegisterInfoEmitter.cpp
787

Nit: This looks unrelated, you should commit it separately.

llvm/utils/TableGen/SequenceToOffsetTable.h
115

I'd rather extract the printing code for this into a printer too.

Rot127 added a comment.EditedSat, Jan 14, 2:15 PM

Thanks for the feedback so far!

I'd rather see smaller, more specific classes, e.g. a RegEmitterPrinter abstract base class which can then be inherited by a RegEmitterCPPPrinter(*) and other similar RegEmitter<Lang>Printer.

The disadvantage with a Printer class per backend is that we have ~2+ files for each backend. So quite a lot.
Although this might not so bad if each backend is moved to its own directory. Just like GlobalISel.

... what happens if you only override a subset of the virtuals? Do you get random bursts of C++ code in your output?

Personally I don't see the danger of unexpected C++ output.
If an introduction document is added how to add a new language Printer (regardless of the design I will definitely add one), it should be clear that all methods for a backend must be overriden.

Using an abstract base class seems like a better choice since then if you forget to override a method (of if one is added to the base class and the subclasses aren't updated) then things should explode in an easy-to-understand/fix way.

The abstract class version was the first implementation I added here. If we agree on backend specific printer classes I would like to go back to this as well.
But with a single Printer for all backends it would force each new language Printer to override methods for other backends as well. Although it only needs one.

(*) We could also have a CommonCPPPrinter (maybe with a better name) that contains things that are the same for all the backends, e.g. emitNamespace and other things like that (although I guess that can be in a later patch).

If each backend gets its own printer class they could simply inherit from such a PrinterCommon or PrinterBase.
So something like : PrinterBase <---- PrinterBackendX <---- PrinterLangY?

Speaking of random bursts of C++ code, how do you handle TableGen backends that allow C++ snippets to be embedded (e.g. for the instruction selection, assuming the docs are up to date)?

I haven't looked at it because for the Capstone use case we only needed the backends listed on top to be honest.
But I'll look into it and report back.

I'd rather extract the printing code for this into a printer too. (The comment to SequenceToOffsetTable.h)

Of cause. But SequenceToOffsetTable is used in a yet not refactored backend. So I left it there for the time being. But I agree that classes like this should be integrated into the Printer.

Would be nice to get a few more opinions about this before reworking the whole patch :)

Absolutely agree :)

nhaehnle requested changes to this revision.Wed, Jan 18, 3:21 AM
nhaehnle added a subscriber: nhaehnle.

I appreciate your desire to extract useful information, but I think this is fundamentally the wrong approach to take.

The interface implied by PrintLLVM is at the wrong level of abstraction: will users who want to extract information really want to extract all this information, and if they do, will they want to do it in the same order and along the same hierarchy? What works for C++ might not work for another language and for another context. Another user of the information here is unlikely to need the exact set that is provided by the existing printer. And so on. In essence, the interface is exactly the wrong way around.

The code is already structured into a "database" part (CodeGenTarget) and a "printer" part, and the printer is quite intrinsically tied to C++ structures in many places. Ideally, you'd just use CodeGenTarget from a new printer, and potentially augment that class with additional features you may need.

The other meta problem here is that you're changing the code to be a lot less maintainable without an easily understandable in-tree motivation. For example: let's say we land this change. Soon after, anybody who looks at the code with these changes will immediately (and correctly) conclude that all those virtual methods aren't necessary and ought to be removed. Therefore, for out-of-tree purposes I would seriously urge you to just write a new printer that re-uses CodeGenTarget.

This would be different if you had an in-tree motivation for the changes.

llvm/utils/TableGen/SequenceToOffsetTable.h
115

I disagree. That's additional complexity that simply isn't needed at this stage.

This revision now requires changes to proceed.Wed, Jan 18, 3:21 AM
Rot127 added a comment.EditedThu, Jan 19, 12:04 PM

Thank you for the feedback!
Its my first time with LLVM so your critique is very valuable for me.

... will users who want to extract information really want to extract all this information,

Please see the example with the DecoderEmitter below. In which case the answer would be yes.

... and if they do, will they want to do it in the same order and along the same hierarchy? What works for C++ might not work for another language and for another context.

Since the Printer controls the complete output, the hierarchy and order problem could be solved by it. Although it probably ends up in complicated printers.
So yes, non C like languages couldn't be printed in an elegant way.

Ideally, you'd just use CodeGenTarget from a new printer, and potentially augment that class with additional features you may need.

I agree that this is the way to go for backends which emit mostly unprocessed information from the CodeGenTarget.
But it is complicated for backends like the DecoderEmitter.
This backend generates a state machine for decoding instructions and emits the C++ version of it. But the whole state machine generation logic is mixed together with the code printing.
So it is not possible to build a printer solely on top of the state-machine generation logic.

And maintaining an altered copy of it is maintenance heavy.

Would you say that in such a case this abstraction would be of use? Or should the state-machine generation be move to the CodeGenTarget in you opinion (as target-independent code generation algorithm I guess)?

My underling problem is that the Emitter backends don't just emit syntax. But quite often process (sometimes heavily) the information given by CodeGenTarget and the Records but don't make the result accessible.

The other meta problem here is that you're changing the code to be a lot less maintainable without an easily understandable in-tree motivation. For example: let's say we land this change. Soon after, anybody who looks at the code with these changes will immediately (and correctly) conclude that all those virtual methods aren't necessary and ought to be removed. Therefore, for out-of-tree purposes I would seriously urge you to just write a new printer that re-uses CodeGenTarget.

I could of cause add a Printer for C, but this is probably irrelevant as long as the other points are not solved.

trogdan removed a subscriber: trogdan.

Ideally, you'd just use CodeGenTarget from a new printer, and potentially augment that class with additional features you may need.

I agree that this is the way to go for backends which emit mostly unprocessed information from the CodeGenTarget.
But it is complicated for backends like the DecoderEmitter.
This backend generates a state machine for decoding instructions and emits the C++ version of it. But the whole state machine generation logic is mixed together with the code printing.
So it is not possible to build a printer solely on top of the state-machine generation logic.

And maintaining an altered copy of it is maintenance heavy.

Would you say that in such a case this abstraction would be of use? Or should the state-machine generation be move to the CodeGenTarget in you opinion (as target-independent code generation algorithm I guess)?

My underling problem is that the Emitter backends don't just emit syntax. But quite often process (sometimes heavily) the information given by CodeGenTarget and the Records but don't make the result accessible.

Yes, that's indeed an issue for what you're trying to do. IMHO it is also an issue for the maintainability of these backends as they are today. So I'd say we should try to separate the state-machine generation code from the printing code. It could be moved to CodeGenTarget, though I think it'd be better to keep things slightly more modular and instead split up the current DecoderEmitter into parts:

  • A Decoder class contains all the state-machine description. Think of it as an "intermediate representation for a decoder/disassembler": it contains the data describing the decoder, but essentially no logic.
  • A separate createDecoder function creates a Decoder given a CodeGenTarget
  • The emitDecoder function writes out the generated C++ for a given Decoder object

Does that make sense to you?