This is an archive of the discontinued LLVM Phabricator instance.

[TableGen][CodeEmitterGen] Add support for querying operand bit offsets
ClosedPublic

Authored by iii on Jul 14 2023, 12:23 PM.

Details

Summary

In order to generate relocations or to apply fixups after the layout
has been computed, the targets need to know the offsets of the
respective operands. There are indirect ways to figure them out in some
cases, for example, on SystemZ, the first memory operand is always at
offset 2, and the second one is always at offset 4. But there are no
such tricks for the immediate operands on SystemZ, so one has to refer
to individual instruction encodings.

This information, however, is available to TableGen. Generate
the getOperandBitOffset() method to access it, and use it to simplify
getting memory operand offsets on SystemZ. This also paves the way for
implementing symbolic immediates on this platform.

For the multi-lit operands, getOperandBitOffset() returns the offset of
the first lit.

An alternative way to obtain offsets would be to pass them to the
encoder methods, but this would require reworking all targets. Also,
VarLenCodeEmitter already does this, but adopting it requires
reworking the respective targets without other significant benefits.

Diff Detail

Event Timeline

iii created this revision.Jul 14 2023, 12:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 12:23 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
iii requested review of this revision.Jul 14 2023, 12:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 12:23 PM
craig.topper added inline comments.Jul 14 2023, 7:37 PM
llvm/utils/TableGen/CodeEmitterGen.cpp
226

If I'm reading this right, this is the offset from the MSB of the encoding to the MSB of the field? Which works for SystemZ because the encoding bytes are emitted in big endian, but would not be useful for a little endian target?

iii updated this revision to Diff 540906.Jul 17 2023, 1:36 AM
  • Add little-endian support.
  • Print operand names.
  • Fix a typo in the #endif comment.
iii added inline comments.Jul 17 2023, 1:38 AM
llvm/utils/TableGen/CodeEmitterGen.cpp
226

You are right, I thought that reverseBitsForLittleEndianEncoding() would handle that, but it doesn't.
I've updated the patch and verified the update with PPC, which sets isLittleEndianEncoding.
For BCCTR (https://www.ibm.com/docs/en/aix/7.2?topic=set-bcctr-bcc-branch-conditional-count-register-instruction) we get:

switch (OpNum) {
case 0:
  // op: BI
  return 20;
}

which matches the drawing and the encoding in practice:

       0: 20 1c 22 4c   bcctr 1, 2, 3

                               bit 20
                                 |
                                 v
BitV  00100000    00011100    00100010    01001100
Bit#  76543210    54321098    32109876    10987654
      +++++++-    ...--+++    +++-----    ------++
Field   528  LK      BH       BO   BI       19  BO
iii marked an inline comment as done.Jul 17 2023, 8:11 AM

I'm not sure when targets are supposed to set isLittleEndianEncoding(). RISC-V encoding is definitely little endian, but we don't set isLittleEndianEncoding().

iii added a comment.Jul 17 2023, 8:30 AM

Hmm, yes, this indeed does not work for RISC-V. E.g., for LUI I get:

switch (OpNum) {
case 1:
  // op: imm20
  return 0;
case 0:
  // op: rd
  return 20;
}

which is in the wrong order:

0000000000000000 <.text>:
   0:	000c72b7          	lui	t0,0xc7

        b7 72 0c 00  # in memory

I'll need to find (or introduce) a different criterion.

iii added a comment.Jul 17 2023, 8:56 AM

Right now there doesn't seem to be a generic way to get the instruction endianness.
The best example is ARM:

ARMDisassembler(const MCSubtargetInfo &STI, MCContext &Ctx,
                const MCInstrInfo *MCII)
    : MCDisassembler(STI, Ctx), MCII(MCII) {
  InstructionEndianness = STI.hasFeature(ARM::ModeBigEndianInstructions)
                              ? llvm::support::big
                              : llvm::support::little;

where it's separate from data endianness and is not available to TableGen.
Even though it's used only for disassembly, it demonstrates the level of flexibility that may be requred.

So I wonder if it would be acceptable to pass endianness as a parameter to getOperandBitOffset() like this?

--- a/llvm/lib/Target/SystemZ/MCTargetDesc/SystemZMCCodeEmitter.cpp
+++ b/llvm/lib/Target/SystemZ/MCTargetDesc/SystemZMCCodeEmitter.cpp
@@ -55,6 +55,7 @@ private:
                                  SmallVectorImpl<MCFixup> &Fixups,
                                  const MCSubtargetInfo &STI) const;
   uint32_t getOperandBitOffset(const MCInst &MI, unsigned OpNum,
+                               bool IsLittleEndian,
                                const MCSubtargetInfo &STI) const;
 
   // Called by the TableGen code to get the binary encoding of operand
@@ -174,7 +175,7 @@ uint64_t SystemZMCCodeEmitter::getDispOpValue(const MCInst &MI, unsigned OpNum,
   if (MO.isImm())
     return static_cast<uint64_t>(MO.getImm());
   if (MO.isExpr()) {
-    uint32_t BitOffset = getOperandBitOffset(MI, OpNum, STI);
+    uint32_t BitOffset = getOperandBitOffset(MI, OpNum, false, STI);
     Fixups.push_back(MCFixup::create(BitOffset >> 3, MO.getExpr(),
                                      (MCFixupKind)Kind, MI.getLoc()));
     assert(Fixups.size() <= 2 && "More than two memory operands in MI?");
diff --git a/llvm/utils/TableGen/CodeEmitterGen.cpp b/llvm/utils/TableGen/CodeEmitterGen.cpp
index dcac1efb7132..ddca6f0a2a68 100644
--- a/llvm/utils/TableGen/CodeEmitterGen.cpp
+++ b/llvm/utils/TableGen/CodeEmitterGen.cpp
@@ -211,12 +211,11 @@ bool CodeEmitterGen::addCodeToMergeInOperand(Record *R, BitsInit *BI,
     unsigned hiBit = loBit + N;
     unsigned loInstBit = beginInstBit - N + 1;
     if (!BitOffsetCaseEmitted) {
-      unsigned bitOffset = Target.isLittleEndianEncoding()
-                               ? beginInstBit
-                               : BI->getNumBits() - beginInstBit - 1;
       BitOffsetCase += "      case " + utostr(OpIdx) + ":\n";
       BitOffsetCase += "        // op: " + VarName + "\n";
-      BitOffsetCase += "        return " + utostr(bitOffset) + ";\n";
+      BitOffsetCase += "        return IsLittleEndian ? " +
+                       utostr(beginInstBit) + " : " +
+                       utostr(BI->getNumBits() - beginInstBit - 1) + ";\n";
       BitOffsetCaseEmitted = true;
     }
     if (UseAPInt) {
@@ -537,6 +536,7 @@ void CodeEmitterGen::run(raw_ostream &o) {
       << "uint32_t " << Target.getName()
       << "MCCodeEmitter::getOperandBitOffset(const MCInst &MI,\n"
       << "    unsigned OpNum,\n"
+      << "    bool IsLittleEndian,\n"
       << "    const MCSubtargetInfo &STI) const {\n"
       << "  switch (MI.getOpcode()) {\n";
     emitCaseMap(o, BitOffsetCaseMap);
iii updated this revision to Diff 541371.Jul 18 2023, 12:52 AM
  • Pass endianness explicitly.
iii updated this revision to Diff 541428.Jul 18 2023, 3:27 AM
  • Fix the handling of multi-lit instructions on little-endian. The code used to return the start of the last lit, which is not useful.

Example output with this change:

case RISCV::AUIPC:
case RISCV::JAL:
case RISCV::LUI: {
  switch (OpNum) {
  case 1:
    // op: imm20
    return IsLittleEndian ? 12 : 0;
  case 0:
    // op: rd
    return IsLittleEndian ? 7 : 20;
  }
  break;
}
case SystemZ::CFI:
case SystemZ::CGFI:
case SystemZ::CIH:
case SystemZ::CLFI:
case SystemZ::CLGFI:
case SystemZ::CLIH:
case SystemZ::IIHF:
case SystemZ::IILF:
case SystemZ::LGFI:
case SystemZ::LLIHF:
case SystemZ::LLILF: {
  switch (OpNum) {
  case 0:
    // op: R1
    return IsLittleEndian ? 36 : 8;
  case 1:
    // op: I2
    return IsLittleEndian ? 0 : 16;
  }
  break;
}
case SystemZ::LLILF: {
  switch (OpNum) {
  case 0:
    // op: R1
    return IsLittleEndian ? 36 : 8;
  case 1:
    // op: I2
    return IsLittleEndian ? 0 : 16;
  }
  break;
}

I guess this numbers are still a bit confusing to me. What exactly does IsLittleEndian specify here? TableGen doesn't really inherently know about the in-memory representation, it defines the instruction as a single integer. E.g. looking at how this instruction format is defined:

let Inst{47-40} = op{11-4};
let Inst{39-36} = R1;
let Inst{35-32} = op{3-0};
let Inst{31-0}  = I2;

we see it's represented as a single 48-bit integer. Its encoding into bytes in memory only happens in the platform-specific encodeInstruction routine here:

void SystemZMCCodeEmitter::encodeInstruction(const MCInst &MI,
                                             SmallVectorImpl<char> &CB,
                                             SmallVectorImpl<MCFixup> &Fixups,
                                             const MCSubtargetInfo &STI) const {
  MemOpsEmitted = 0;
  uint64_t Bits = getBinaryCodeForInstr(MI, Fixups, STI);
  unsigned Size = MCII.get(MI.getOpcode()).getSize();
  // Big-endian insertion of Size bytes.
  unsigned ShiftValue = (Size * 8) - 8;
  for (unsigned I = 0; I != Size; ++I) {
    CB.push_back(uint8_t(Bits >> ShiftValue));
    ShiftValue -= 8;
  }
}

So I would have expected the getOperandBitOffset call to return one of the values explicitly called out in the instruction format definition, i.e. 39 or 36 for R1 in the above example. (I guess it doesn't matter which one, as the user will know the field length, so they are able to compute the other one.) Using that value, plus the total size of the current instruction (which can be gotten via MCII.get(MI.getOpcode()).getSize()), the caller should able to compute whatever they need.

iii updated this revision to Diff 541592.Jul 18 2023, 9:12 AM
  • Move the endianness handling to the target C++ code.

The SystemZ parts LGTM. The TableGen changes also look reasonable to me, but I'd also like to wait for Craig's input on whether his concerns are resolved with this version.

It would be good to have a comment spelling out the explicit semantics of the new getOperandBitOffset routine that targets can rely on. Maybe the top of CodeEmitterGen.cpp would be a good place for this.

iii updated this revision to Diff 542235.Jul 19 2023, 4:17 PM
  • Document the new function.
craig.topper added inline comments.Jul 19 2023, 4:23 PM
llvm/utils/TableGen/CodeEmitterGen.cpp
11

MachineInstr should be MCInst here and in the new paragraph.

iii updated this revision to Diff 542326.Jul 20 2023, 12:01 AM
  • s/MachineInstr/MCInst/g
This revision is now accepted and ready to land.Jul 20 2023, 12:54 AM
This revision was landed with ongoing or failed builds.Jul 20 2023, 1:11 AM
This revision was automatically updated to reflect the committed changes.

@iii I'm seeing build warnings (which we treat as errors in most builds) due to this:

E:\llvm\ninja\lib\Target\SystemZ\SystemZGenMCCodeEmitter.inc(15414): warning C4060: switch statement contains no 'case' or 'default' labels

which came from:

    case SystemZ::CSCH:
    case SystemZ::HSCH:
    case SystemZ::IPK:
    case SystemZ::NNPA:
    case SystemZ::NOP_bare:
    case SystemZ::PALB:
    case SystemZ::PCC:
    case SystemZ::PCKMO:
    case SystemZ::PFPO:
    case SystemZ::PR:
    case SystemZ::PTFF:
    case SystemZ::PTLB:
    case SystemZ::RCHP:
    case SystemZ::RSCH:
    case SystemZ::SAL:
    case SystemZ::SAM24:
    case SystemZ::SAM31:
    case SystemZ::SAM64:
    case SystemZ::SCHM:
    case SystemZ::SCKPF:
    case SystemZ::TAM:
    case SystemZ::TEND:
    case SystemZ::TRAP2:
    case SystemZ::UPT:
    case SystemZ::XSCH: {
      switch (OpNum) {
      }
      break;
    }
  }
  std::string msg;
  raw_string_ostream Msg(msg);
  Msg << "Not supported instr[opcode]: " << MI << "[" << OpNum << "]";
  report_fatal_error(Msg.str().c_str());
}

#endif // GET_OPERAND_BIT_OFFSET

Please can you update this so that empty case map entries are not emitted?

iii added a comment.Jul 20 2023, 2:00 AM

Thanks for letting me know, I'll fix this.