FixedLenDecoderEmitter used to emit plain C++ code to decode instructions.
The problem with this is that it tends to generate hundreds or thousands of
functions depending on the backend, which causes the compilation of the
generated file to be slow due to the sheer quantity of functions.
The C++ code that it generated was highly redundant. For example:
- The same bits patterns were extracted by many different decoder functions.
- The same decoder methods were called in many places in the file.
The generated code was compressed by exploiting these redundancies as follows:
- Sequences of identical bit extractions operation sequences were unified.
- Sequences of identical decoder method calls were unified.
- Identical sequences of bit extractions and decoder methods were unified.
Each decoder is then defined as a list of IDs that represent a sequence
of operations to do, based on the operations listed above.
Every call to decodeToMCInst is implemented by a sequence of bit extractions and
decoder method calls. Given the list of extraction operation IDs and the list of
decoder method IDs used by a given decoder, we just need to know the sequence in
which we should pop and execute either one extractor or one decoder method. By
formalizing this, the original behavior of the FixedLenDecoderEmitter was
reimplemented as a state machine that executes a list of operations that either
do a bit extraction or do a decoder method call. This makes decodeToMCInst work
in a mostly data-driven way, so the large number of functions that caused the
original performance problems are now mostly all gone. What is left now is a
bunch of constant arrays of integers, which are fast to compile and
often more space-efficient in terms of binary size.
To test the effect of this patch, the following test was done for every
backend, before and after this patch:
- Build everything with ninja.
- Delete the built BACKENDGenDisassemblerTables.inc file (where BACKEND = the name of the backend)
- Run ninja again with time and measure the time taken to rebuild.
Our experimental results show that this unilaterally improves compile time of
GenDisassemblerTables.inc for all backends that use the FixedLenDecoderEmitter:
- AArch64: 7.6 seconds -> 3.8 seconds
- AMGPU: 16.5 seconds -> 5.9 seconds
- ARM: 11.1 seconds -> 7.3 seconds
- Hexagon: 5.7 seconds -> 3.5 seconds
- Mips: 6.4 seconds -> 4.2 seconds
As far as the generated size, there are some wins and some losses, but it's a
net reduction by 425K. What follows is the diff of the size reported by ls -lh
of each lib/libLLVMBACKENDDisassembler.a, where BACKEND is the given backend.
- AArch64: 283K -> 205K
- AMDGPU: 524K -> 245K
- ARM: 430K -> 479K
- BPF: 18K -> 22K
- Hexagon: 184K -> 106K
- Lanai: 17K -> 25K
- MSP430: 25K -> 27K
- Mips: 172K -> 147K
- PowerPC: 80K -> 72K
- RISCV: 36K -> 42K
- Sparc: 40K -> 51K
- SystemZ: 163K -> 91K
- XCore: 45K -> 80K