This is an archive of the discontinued LLVM Phabricator instance.

[TableGen] Make MCInst decoding more table-driven
Needs ReviewPublic

Authored by nlguillemot on Aug 16 2019, 4:02 PM.

Details

Reviewers
dsanders
Summary

FixedLenDecoderEmitter used to emit plain C++ code to decode instructions.
The problem with this is that it tends to generate hundreds or thousands of
functions depending on the backend, which causes the compilation of the
generated file to be slow due to the sheer quantity of functions.

The C++ code that it generated was highly redundant. For example:

  • The same bits patterns were extracted by many different decoder functions.
  • The same decoder methods were called in many places in the file.

The generated code was compressed by exploiting these redundancies as follows:

  • Sequences of identical bit extractions operation sequences were unified.
  • Sequences of identical decoder method calls were unified.
  • Identical sequences of bit extractions and decoder methods were unified.

Each decoder is then defined as a list of IDs that represent a sequence
of operations to do, based on the operations listed above.

Every call to decodeToMCInst is implemented by a sequence of bit extractions and
decoder method calls. Given the list of extraction operation IDs and the list of
decoder method IDs used by a given decoder, we just need to know the sequence in
which we should pop and execute either one extractor or one decoder method. By
formalizing this, the original behavior of the FixedLenDecoderEmitter was
reimplemented as a state machine that executes a list of operations that either
do a bit extraction or do a decoder method call. This makes decodeToMCInst work
in a mostly data-driven way, so the large number of functions that caused the
original performance problems are now mostly all gone. What is left now is a
bunch of constant arrays of integers, which are fast to compile and
often more space-efficient in terms of binary size.

To test the effect of this patch, the following test was done for every
backend, before and after this patch:

  • Build everything with ninja.
  • Delete the built BACKENDGenDisassemblerTables.inc file (where BACKEND = the name of the backend)
  • Run ninja again with time and measure the time taken to rebuild.

Our experimental results show that this unilaterally improves compile time of
GenDisassemblerTables.inc for all backends that use the FixedLenDecoderEmitter:

  • AArch64: 7.6 seconds -> 3.8 seconds
  • AMGPU: 16.5 seconds -> 5.9 seconds
  • ARM: 11.1 seconds -> 7.3 seconds
  • Hexagon: 5.7 seconds -> 3.5 seconds
  • Mips: 6.4 seconds -> 4.2 seconds

As far as the generated size, there are some wins and some losses, but it's a
net reduction by 425K. What follows is the diff of the size reported by ls -lh
of each lib/libLLVMBACKENDDisassembler.a, where BACKEND is the given backend.

  • AArch64: 283K -> 205K
  • AMDGPU: 524K -> 245K
  • ARM: 430K -> 479K
  • BPF: 18K -> 22K
  • Hexagon: 184K -> 106K
  • Lanai: 17K -> 25K
  • MSP430: 25K -> 27K
  • Mips: 172K -> 147K
  • PowerPC: 80K -> 72K
  • RISCV: 36K -> 42K
  • Sparc: 40K -> 51K
  • SystemZ: 163K -> 91K
  • XCore: 45K -> 80K

Diff Detail

Event Timeline

nlguillemot created this revision.Aug 16 2019, 4:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2019, 4:02 PM
nlguillemot edited the summary of this revision. (Show Details)

Rebased the code and cleaned up the commit message a bit.

Nice. Are you planning to measure and share performance numbers for the decoding itself?

Nice. Are you planning to measure and share performance numbers for the decoding itself?

I haven't tried but it might be interesting. Do you have a suggestion for how I could benchmark that?

I should also mention, the numbers in the commit message were gathered when this review was originally created, which was a while ago.

I haven't tried but it might be interesting. Do you have a suggestion for how I could benchmark that?

I'm probably not the best person to review this patch or answer that but I would say just disassemble a large number of files with llvm-objdump and see how your patch impacts the disassembly time. Some note:

  • Compile some common programs with typical flags to obtain object files that are representative.
  • Repeat the benchmark for all of the archs, as the results might differ. Which implies you need to build cross-platform programs.
  • Try to isolate spurious factors like cold caches, e.g. by disassembling twice and keeping the second value.
  • Send the output of llvm-objdump to /dev/null.

You could also just microbenchmark the decoding itself with custom code. That will better isolate the impact of your changes but it won't elucidate what the impact is on actual programs like llvm-objdump. I expect that your patch will reduce icache pressure and increase dcache pressure, which might have interesting interactions. But maybe none of that matters because the runtime impact of your patch is either not significant or clearly always positive.

Rebased the patch and added more description to the test check lines.

Here's what I tried to test the performance:

  1. I made the following modification which deliberately exaggerates the performance cost of disassembly:
diff --git a/llvm/tools/llvm-mc/Disassembler.cpp b/llvm/tools/llvm-mc/Disassembler.cpp
index 16ab99548adf..1a584e4023c1 100644
--- a/llvm/tools/llvm-mc/Disassembler.cpp
+++ b/llvm/tools/llvm-mc/Disassembler.cpp
@@ -46,7 +46,8 @@ static bool PrintInsts(const MCDisassembler &DisAsm,
     MCInst Inst;
 
     MCDisassembler::DecodeStatus S;
-    S = DisAsm.getInstruction(Inst, Size, Data.slice(Index), Index, nulls());
+    for (int i = 0; i < 200; i++)
+      S = DisAsm.getInstruction(Inst, Size, Data.slice(Index), Index, nulls());
     switch (S) {
     case MCDisassembler::Fail:
       SM.PrintMessage(SMLoc::getFromPointer(Bytes.second[Index]),
  1. I ran llvm-lit on the llvm/test/MC/Disassembler folder and measured the time it takes to run the tests. (Note: The modification above causes some tests to fail, but most of them still pass.)

I turned off all other apps on my computer and I tried to give time for my machine to cool down a bit between runs, so hopefully the measurement is fair and roughly stable.

The results of running llvm-lit on this folder before and after the patch are as follows.

Before:

0m42.260s
0m43.443s
0m43.443s
0m45.445s
0m44.963s
0m43.998s
0m45.456s
0m43.990s
0m44.779s
0m44.253s
Average: 0m44.2031s

After:

0m43.732s
0m43.697s
0m44.078s
0m44.273s
0m44.415s
0m45.005s
0m44.738s
0m44.630s
0m44.466s
0m45.041s
Average: 0m44.4075s

Based on these results it looks like there might be a relatively small (<1%) runtime regression.

I did some tests locally and found the performance could be improved more in two ways:

  1. Try to use LEB128 less where possible. It's slow to decode LEB128.
  2. Make specialized bit extractor functions for some common arrangements of bit extractor parameters, then dispatch to these specialized bit extractors by adding more enums to DecoderCodeletID. Improves performance a bit, but increases the complexity of the implementation, so I'm not sure if it's worth it since I think this code is usually not a bottleneck anyways.

Disclaimer: These measurements are from a long time ago, don't know if it would still come out the same way now.