This is an archive of the discontinued LLVM Phabricator instance.

[ifs] Allow llvm-ifs to generate text stub from elf stub
ClosedPublic

Authored by haowei on Jan 6 2022, 2:36 PM.

Details

Summary

ELF stubs generated from llvm-ifs lacks program headers, which prevents llvm-ifs from parsing them properly as program headers are required by llvm's own ELF libraries. This patch adds a few workaround bypass this limitation.

llvm::Object::ElfFile::toMappedAddr() function requires valid program header, which was used by llvm-ifs's elf parser to determine the pointers to the strtab and dynsym sections. Instead of adding program header sections to the ELF stub output, this change reads .dynsym section from section headers and use sh_link and sh_offset to calculate the pointers to the required sections. If these step fails, it then fallbacks to the original method of using program headers. If both methods failed, An error will be thrown.

Diff Detail

Event Timeline

haowei created this revision.Jan 6 2022, 2:36 PM
haowei requested review of this revision.Jan 6 2022, 2:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2022, 2:36 PM
mcgrathr added inline comments.Jan 6 2022, 5:57 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
593

This value cannot be safely used at all. The DT_STRTAB address is not something that can be presumed to be an offset into the file, even if that's often true.

The behavior I'd expect here is to first look for section headers. If there are section headers at all, then purely look for the SHT_DYNSYM section and follow its sh_link to its SHT_STRTAB section and use its sh_offset and sh_size. If there are section headers at all, but no SHT_DYNSYM section or its sh_link doesn't point to an SHT_STRTAB section, then you don't need to handle this input file. It's invalid or empty.

Only if there are no section headers at all should you fall back to decoding based on phdrs. This should be the same for finding the SHT_DYNAMIC section in the first place, only falling back to PT_DYNAMIC if there are no section headers at all.

Also note that for DT_NEEDED et al, strings pointed to by SHT_DYNAMIC entries, technically it's the sh_link of the SHT_DYNAMIC that tells you what string table section to use. In practice, it's always the same one as the one that SHT_DYNSYM's sh_link points to, but by the spec they can be separate when dealing with sections.

632

Same issues here.

haowei updated this revision to Diff 398260.Jan 7 2022, 3:39 PM
haowei edited the summary of this revision. (Show Details)
haowei marked an inline comment as done.Jan 7 2022, 3:42 PM
haowei added inline comments.
llvm/lib/InterfaceStub/ELFObjHandler.cpp
593

I have corrected the pointers to the .dynstr and .dynsym according to your suggestions.

I will submit another patch to correct the DT_NEEDED as it is out of the scope of this patch. Thanks for pointing that out.

LGTM modulo the fallback behaviors as mentioned below. But I'll leave the approval to those who have done the review for this specific codebase more generally, since I'm not sure on all the points of LLVM style or what library features are available or best to use.

mcgrathr added inline comments.Jan 7 2022, 4:39 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
529

I would actually fall back here only if there are no section headers at all. If there are any section headers at all but no SHT_DYNSYM section, then don't look further. i.e. return failure after the for loop above inside the if.

535

This is inconsistent with the Shdrs case above wrt whether or not to use bool coercion for a zero check.
I personally like avoiding the implicit coercion and spelling it out like you did here.
But I think it's most important to be consistent in style for all the integer cases (using implicit bool coercion for pointer nullptr checks is canonical style in C++, but for integers it's less clear).

570

Same here wrt the fallback logic.

573

What's the purpose of fetching and checking PHdrs here when you just use toMappedAddr (which I presume only works if the phdrs were present to tell it what to do)?

haowei marked an inline comment as done.Jan 7 2022, 5:16 PM
haowei added inline comments.
llvm/lib/InterfaceStub/ELFObjHandler.cpp
529

If it fail when .dynsym is absent, test https://github.com/llvm/llvm-project/blob/ebd8eee62a431a6744c3f187fcda58e5dea08499/llvm/test/tools/llvm-ifs/binary-read-add-soname.test will fail (there are 7 similar tests) as there is no .dynsym at all, .dynstr (though empty) is pointed by .dynamic . I believe either fallback strategy won't cause any issues in the field.

If an ELF has no symbols (which is unlikely), should it have no .dynsym section or should it have an empty .dynsym?

535

Thanks for pointing this out. I will delete this check since toMappedAddr will fail with an empty program headers. I can re-add the check if you think it is necessary to add a more detailed error message when the program header is empty.

573

It was a check before toMappedAddr since this function will fail if program headers are empty. I will remove the check.

haowei updated this revision to Diff 398284.Jan 7 2022, 5:47 PM
haowei updated this revision to Diff 398290.Jan 7 2022, 6:24 PM
haowei marked an inline comment as done.
haowei added inline comments.
llvm/lib/InterfaceStub/ELFObjHandler.cpp
529

I uploaded a patch using your suggested fallback logic and modified affected tests (and deleted one as it not achievable).

mcgrathr added inline comments.Jan 7 2022, 9:19 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
529

No SHT_DYNSYM section could ever be empty, because index 0 is reserved and must contain a stub entry.

I've never seen a linker that produces a .dynamic at all without having a .dynsym. clang -nostdlib -shared -o /tmp/foo.so -xassembler /dev/null with a toolchain that uses current LLD by default produces a .dynsym that has nothing but the null entry. With some current or past GNU linkers, that degenerate case would still produce either some local STT_SECTION symbols in .dynsym (which really never had a reason to be there), or always export some symbols like _end or _edata in .dynsym so it was never empty of symbols even when it probably should have been.

So I think basically, if there are section headers at all and this is a real-world ET_DYN file of any plausible normal sort of provenance that development tools were ever really expected to handle, then finding no SHT_DYNSYM at all is more likely to indicate a corrupted file or tool bug than a degenerate ELF file that's actually harmless. Likewise, if you're looking at PT_DYNAMIC data via phdrs instead, then once you've found the dynamic section then if it doesn't include all of DT_{SYMTAB,SYMENT,STRTAB,STRSZ} then it's more likely to be a corrupt image than a valid degenerate ELF file with no dynamic symbol table. Even a static PIE that has no dynamic relocs that require referring to a symbol table at all (only simple fixups) will in practice still have a symbol table with the null entry at index 0 and an empty string table but with an address and headers as if it had a length other than zero.

535

I think whatever form of "the memory access didn't work" error it produces is fine without being any more specific about why it failed for this case than it is for any other case.

I've not really dug into this patch at this point, but one high-level thing I wanted to highlight: it looks like you're doing a lot of reading of the object file to interpret the contents, in ways that tools like llvm-readelf and llvm-objdump already do. Do you actually need this much additional code as a result? It might need some refactoring, but it seems like we're otherwise reinventing the wheel...!

llvm/lib/InterfaceStub/ELFObjHandler.cpp
229

This feels like it should be its own independent fix, with corresponding testing etc.

529

FWIW, we have a proprietary linker that sticks dynamic data like the dynsym and dynstr inside a combined section, pointed to by dynamic tags contained in a (normal) dynamic section. As such, in our case, we would have a .dynamic, but no .dynstr or .dynsym (in the section header table). We don't use this tool however, and I know our case is unusual, so don't feel a need to handle it necessarily - just highlighting that it is a possible case (just not with, to my knowledge, mainstream linkers).

haowei updated this revision to Diff 399099.Jan 11 2022, 3:28 PM
haowei added inline comments.Jan 11 2022, 3:30 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
229

I splitted it into https://reviews.llvm.org/D117058. Please take a look.

529

I am slightly lean a bit more with the fallback logic in diff3 (if .dynsym is not found, fallback to use program header and pointers from .dynamic) instead of exit early when .dynsym is not found. I bet there will be other elf tools in the wild that may generate ELF files with not common sections arrangements and it may help to avoid hassles in the future. What do you think?

I've not really dug into this patch at this point, but one high-level thing I wanted to highlight: it looks like you're doing a lot of reading of the object file to interpret the contents, in ways that tools like llvm-readelf and llvm-objdump already do. Do you actually need this much additional code as a result? It might need some refactoring, but it seems like we're otherwise reinventing the wheel...!

I agree it feels like re-inventing the wheel. I spent sometime reading llvm-objdump and llvm-readobj 's code and actually they both have their own dedicated ELF dumper implementation and they are not interchangeable. While theoretically it is possible to do refractor so all 3 tools can use the same ELF parsing implementation, but it looks like a large amount of work. In both llvm-objdump and llvm-readobj's case, the ELF dumper implementation did not provide any interface to access sections or symbol tables, instead, both tool just print the section data into IO streams, making them difficult to reuse. In llvm-ifs's case, most of ELF parsing is for extracting symbol tables from the ELF file, other sections like .rela are completely ignored. The usages and requirements of all 3 tools are not entirely overlapped.

mcgrathr added inline comments.Jan 11 2022, 5:23 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
529

SGTM. The suggestion to bail early was based on the idea that it might be a sign of corrupt input rather than anything else in practice. Even just knowing that there exists any linker that produces output that doesn't match the assumptions I stated is reason enough to be more accepting of any input we can find a way to work with.

haowei updated this revision to Diff 399196.Jan 11 2022, 9:26 PM
haowei edited the summary of this revision. (Show Details)

Change the fallback logics to Diff3. Also adds parents revision to make presubmit check happy.

jhenderson added inline comments.Jan 12 2022, 2:01 AM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
516–518

LLVM style says no braces for simple if/loop statements. Also applies below.

You might want some blank lines to break up the stages of this function, especially after deleting the unnecessary braces.

522
523

The fact that the getDynSymtabSize function exists makes me think we should have a function getDynSymtabSection function in the ElfFile interface. Possibly this getDynStr function should also be moved there.

536

I'm not sure if the doxygen style comments are really needed on static functions, but I'm not necessarily opposed to them.

540

Similar note to above, maybe this should be part of the ElfFile interface?

546

Same as above re. braces, here and below.

549

Noting that most of this function is almost identical to above, so probably you want to have a function that finds a dynsym section and returns it, so that these two functions can work on them as appropriate.

haowei updated this revision to Diff 399481.Jan 12 2022, 3:00 PM
haowei marked 5 inline comments as done.Jan 12 2022, 3:04 PM
haowei added inline comments.
llvm/lib/InterfaceStub/ELFObjHandler.cpp
523

I tried this idea in https://gist.github.com/zeroomega/3d19c04cdc489d99f5ff0d0974b61772 but I am kind of regretting it. It makes the fallback logic a bit messy, partially due to how llvm::Error is constructed. Also I still need to fetch ElfFile.sections() again when extracting the StringRef so it didn't actually make this function more tidy.

540

See the explanation above.

546
jhenderson added inline comments.Jan 13 2022, 1:26 AM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
522

You missed the additional "are".

549

How about having a function that takes a lambda "OnFound", to share the majority of the logic between these two functions (that may or may not then be sensible to add to the ElfFile interface). Approximate untested implementation and usage:

template <class ELFT>
Expected<const uint8_t*> getSectionFromDynsym(const ElfFile<ELFT> &ElfFile, ELFT::Elf_Addr DynAddr, StringRef Name, function_ref<Expected<const uint8_t *>(const ELFT::Shdr &, const ELFT::ShdrRange &) OnFound) {
    using Elf_Shdr_Range = typename ELFT::ShdrRange;
  using Elf_Shdr = typename ELFT::Shdr;

  Expected<Elf_Shdr_Range> Shdrs = ElfFile.sections();
  if (!Shdrs)
    return Shdrs.takeError();

  if (Shdrs.get().size()) {
    for (const Elf_Shdr &Sec : *Shdrs) {
      if (Sec.sh_type == ELF::SHT_DYNSYM) {
        // If multiple .dynsym are present, use the first one.
        // This behavior aligns with llvm::object::ELFFile::getDynSymtabSize().
        return OnFound(Sec, *Shdrs);
      }
    }
  }  // Fallback to use offsets from .dynamic as .dynsym was not found.
  Expected<const uint8_t *> PtrOrErr =
      ElfFile.toMappedAddr(DynAddr);
  if (!PtrOrErr)
    return appendToError(PtrOrErr.takeError(),
                         "when locating " + Name + " section contents");
  return *PtrOrErr;
}

template <class ELFT>
static Expected<const uint8_t *> getDynSymPtr(const ELFFile<ELFT> &ElfFile,
                                              const DynamicEntries &DynEnt) {
  return getSectionFromDynsym(ElfFile, DynEnt.DynSymAddr, "dynamic symbol table", [&ElfFile](const ELFT::Elf_Shdr &Sec, const ELFT::ShdrRange &Shdrs){ return ElfFile.base() + Sec.sh_offset(); });
}

template <class ELFT>
static Expected<StringRef> getDynStr(const ELFFile<ELFT> &ElfFile,
                                              const DynamicEntries &DynEnt) {
  Expected<StringRef> ResultOrErr = getSectionFromDynsym(ElfFile, DynEnt.DynStrAddr, "dynamic string table", [&ElfFile](const ELFT::Elf_Shdr &Sec, const ELFT::ShdrRange &Shdrs){ return ElfFile.getStringTableForSymtab(Sec, *Shdrs); });
  if (!ResultOrErr)
    return ResultOrErr;
  return StringRef(reinterpret_cast<const char *>(*ResultOrErr), DynEnt.StrSize);
}

I've also made some more idiomatic changes to the names and usage of the Expected at the end, which you should probably adopt anyway (specifically using "*OrErr" for the variable name, and using operator* to get the real value). Also, as a related aside, I'd avoid using explicitly ".dynsym" and ".dynstr" as names if you're looking up by type, as there is a chance these sections aren't named that. My example code should illustrate it better, hopefully.

552

Ditto (additional "are" needed).

haowei marked an inline comment as done.Jan 13 2022, 4:45 PM
haowei added inline comments.
llvm/lib/InterfaceStub/ELFObjHandler.cpp
549

Thanks for providing this code snippet, I modified it and finished a version that passed all the test: https://gist.github.com/zeroomega/9552161efca45ad8af17a3f312582753

Unfortunately I don't think it can achieve the goal we want. IMO the lambda solution is a bit less readable. And it is about 5lines longer if you strip the comments. I think the cause is that getDynStr and getDynSym has 2 different type of return values. The first one returns a StringRef while the other one returns a raw pointer. The API getStringTableForSymtab used in getDynStr has the same return type. If both function used a common helper function, the runtion will need to return raw pointer, and that is where all the conversion needed. If you look at L50, I have to convert StringRef back to raw pointer to accommodate that. And I have to check the Expect<T> before the conversion, otherwise I am getting Expected<T> must be checked before access or destruction. error.

haowei updated this revision to Diff 399838.Jan 13 2022, 4:48 PM
haowei marked 3 inline comments as done.Jan 13 2022, 4:51 PM
jhenderson added inline comments.Jan 14 2022, 2:34 AM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
549

My main goal is removing the near-identical code between these two functions, as duplicated code will lead to more maintenance problems down the line. I have two ideas to resolve this:

  1. We could solve this issue by changing the signature of getDynSymPtr to return an Expected<StringRef>. A cast could be added to convert within the lambda, if needed, but it doesn't make it significantly more complex. The call site would need a small tweak too, but it could happen after the Expected has been checked (a quick look shows there's already a cast, so it just would require addition of a .data() call somewhere when doing the existing cast).
  1. Another idea I had would be to somehow template the return type. I believe this isn't an option using plain old functions, but you could wrap it in a templated class, I believe, and make the function a member of that class.

I'm not sure which would look cleaner overall, but both should work, and resolve the duplicated code.

haowei updated this revision to Diff 400163.Jan 14 2022, 3:06 PM
haowei added inline comments.Jan 14 2022, 3:13 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
549

For suggestion 1. That is a hard no for me. DynSym data are binary I don't think it is wise to fill it into a StringRef which is suppose to just host references to strings. For suggestion 2. I would like to avoid use template class type for return value. It does not help the readability and makes the problem complicated. Instead, I implemented a solution in Diff 9, in which I reuse the code as best as I can. Please take a look.

I am still lean to Diff 8 since it is more straight forward and easier to understand. But if you are insist on no code should be repeated, I am fine with the solution in Diff 9.

If there is no further objection or comments, can I get a LGTM to land this change please?

If there is no further objection or comments, can I get a LGTM to land this change please?

Sorry, I've been off for two weeks, and only now getting back to these things.

llvm/lib/InterfaceStub/ELFObjHandler.cpp
517–518

This should be an assert.

524

No need for this if: the loop will do nothing if Shdrs is empty.

535–537

No need for else after return

545

I'd probably refer to them as "dynamic string table" and "dynamic symbol table", rather than by section names, since they could be called other things.

549

That is a hard no for me. DynSym data are binary I don't think it is wise to fill it into a StringRef which is suppose to just host references to strings.

Fine. Use a return type of ArrayRef<char>, and use data() and size() to construct the value then... A StringRef is a reference to an array of chars, so it's really the same... I'm pretty sure StringRef is often used in that context, even for binary data.

The current implementation in this patch isn't terrible, but it's not particularly nice either: the Type argument turns the function into a bi-modal function, so this function is being used to do two different things, which isn't really good software design. I think we can do better.

Latest alternate idea: have a class (perhaps simply called "DynSym" or similar), which during initialisation looks up the dynsym section header and stores it in a pointer, leaving it as null if not found. Then make the getDynSym and getDynStr functions member functions, which simply use the precalculated info to return the data as appropriate. This way, the shared data (the dynsym section header) is only looked up once. The two separate methods do only the behaviour that is specific to the relevant table you're looking up. Untested implementation:

class DynSym {
  using Elf_Shdr_Range = typename ELFT::ShdrRange;
  using Elf_Shdr = typename ELFT::Shdr;

public:
  static Expected<DynSym> create(const ELFFile<ELFT> &ElfFile, const DynamicEntries &DynEnt) {
      Expected<Elf_Shdr_Range> Shdrs = ElfFile.sections();
      if (!Shdrs)
        return Shdrs.takeError();
      return DynSym(ElfFile, DynEnt, *Shdrs);
  }

  Expected<const uint8_t*> getDynSym() {
    if(DynSymHdr)
      return ElfFile.base() + DynSymHdr->sh_offset;
    return getDynamicData(DynEnt.DynSymAddr, "dynamic symbol table");
  }

  // StringRef or uint8_t* return according to taste as to where the casts go.
  Expected<StringRef> getDynStr() {
    if(DynSymHdr)
      return ElfFile.getStringTableForSymtab(DynSymHdr, Shdrs);
    Expected<const uint8_t *> DataOrErr = getDynamicData(DynEnt.DynSymAddr, "dynamic string table");
    if (!DataOrErr)
      return DataOrErr.takeError();
    return StringRef(*DataOrErr, DynEnt.StrSize);
  }

private:
  DynSym(const ELFFile<ELFT> &ElfFile, const DynamicEntries &DynEnt, Elf_Shdr_Range Shdrs)
    : ElfFile(ElfFile), DynEnt(DynEnt), Shdrs(Shdrs), DynSymHdr(findDynSymHdr(Shdrs)) {}

  Elf_Shdr *findDynSymHdr() {
    // Probably use find_if?
    for (const Elf_Shdr &Sec : Shdrs)
      if (Sec.sh_type == SHT_DYNSYM)
        return &Sec;
    return nullptr;
  }

  Expected<const uint8_t *> getDynamicData(uint64_t Address, StringRef Name) {
      Expected<const uint8_t *> SecPtr = ElfFile.toMappedAddr(EntAddr);
      if (!SecPtr)
        return appendToError(SecPtr.takeError(), "when locating " + Name + " section contents");
      return *SecPtr;
  }

  const ELFFile<ELFT> &ElfFile;
  const DynamicEntries &DynEnt;
  Elf_Shdr_Range Shdrs;
  Elf_Shdr *DynSymHdr;
};
haowei updated this revision to Diff 407335.Feb 9 2022, 4:13 PM
haowei added inline comments.Feb 9 2022, 4:19 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
549

Thanks for the suggestion. I think this is better than the solutions earlier and I have rewrite this patch using the class you suggested. I think overall it looks cleaner and easier to understand.

jhenderson added inline comments.Feb 10 2022, 1:15 AM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
374

Sorry, this comment was intended purely for illustration of my example! You don't need it in the real code.

402–404

StringRef + string literals converts to llvm::Twine. Prefer that for concatenation and then use .str() to convert back to a std::string afterwards, as it's more efficient.

589–591

I think you can just do the inline edit.

593

I'm pretty sure this is more idiomatic.

629
llvm/test/tools/llvm-ifs/ifs-elf-conversion.test
4–6

As the test is the sequence of the two commands, rather than the two commands being independent, I think the blank line between them should be removed.

haowei updated this revision to Diff 407620.Feb 10 2022, 11:37 AM
haowei marked 4 inline comments as done.
haowei marked 2 inline comments as done.Feb 10 2022, 11:39 AM
jhenderson accepted this revision.Feb 11 2022, 12:00 AM

Looks reasonable to me. You probably want to give @mcgrathr a chance to confirm too.

This revision is now accepted and ready to land.Feb 11 2022, 12:00 AM
mcgrathr accepted this revision.Feb 11 2022, 1:54 PM

lgtm except for possible susceptibility to invalid DT_STRSZ values.
Perhaps there should be a test case for a corrupted PT_DYNAMIC region.

llvm/lib/InterfaceStub/ELFObjHandler.cpp
382

Is there anything that validates that this size is within the bounds of the memory returned by toMappedAddr?

haowei added inline comments.Feb 11 2022, 2:40 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
382

There is no validating in this implementation and I think that is an issue. It would need a rewrite or reimplement of toMappedAddr to validate if the size is within the bounds of the same PT_LOAD segment returned by toMappedAddr. But if we just want validate if the size is still within all PT_LOAD, we can probably do:

Expected<const uint8_t *> SecPtr = ElfFile.toMappedAddr(EntAddr);
Expected<const uint8_t *> SecEndPtr = ElfFile.toMappedAddr(EntAddr + DynEnt.StrSize);

And check if SecPtr and SecEndPtr contains any errors. Does that looks OK to you?

mcgrathr added inline comments.Feb 11 2022, 3:12 PM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
382

That would be better than nothing, but still susceptible to integer overflow. If there is nothing more robust than this that is easy to do without further refactoring, this seems like a wise incremental improvement. I think the long-term solution will indeed be to change this API to return some span-like type rather than a pointer so that accesses relative to the base address can be verified safe.

haowei updated this revision to Diff 408322.Feb 13 2022, 8:36 PM
haowei updated this revision to Diff 408323.Feb 13 2022, 9:02 PM

Adds minimal validation of STRSZ from .dynamic.

This revision was landed with ongoing or failed builds.Feb 13 2022, 9:06 PM
This revision was automatically updated to reflect the committed changes.
jhenderson added inline comments.Feb 14 2022, 12:46 AM
llvm/lib/InterfaceStub/ELFObjHandler.cpp
408–412

I think you need a test case for this new error.