This is an archive of the discontinued LLVM Phabricator instance.

Fix identification of COFF executable files
ClosedPublic

Authored by zturner on Mar 7 2018, 1:21 PM.

Details

Summary

identify_magic on COFF executables has been broken for apparently quite some time. The reason is because we only read the first 32 bytes of the file, but the detection algorithm for COFF requires at *minimum* 64 bytes, but in practice you need to read the entire file because the algorithm reads a 4-byte value at offset 60 and then treats that as an offset, from which it reads additional data.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner created this revision.Mar 7 2018, 1:21 PM
smeenai added a subscriber: smeenai.Mar 7 2018, 1:23 PM

Does anyone know why we bother with the "PE\0\0" validation instead of just relying on the "MZ" magic check? File magic checks are supposed to be heuristic anyway.

rnk added a subscriber: echristo.Mar 7 2018, 1:39 PM

I did some archaelogy and found that this is from rL128799, committed by @echristo on behalf of Graydon Hoare. I think we should just go ahead and remove the "PE" check.

In D44225#1030532, @rnk wrote:

I did some archaelogy and found that this is from rL128799, committed by @echristo on behalf of Graydon Hoare. I think we should just go ahead and remove the "PE" check.

It's true that file magic checks are heuristic, but having it rely on only 2 bytes will have quite a lot of false positives on binary files.

How expensive is it to MemoryBuffer::getFilePath() in all cases? Unless we have reason to believe it is expensive, I would say let's just do that, and then we can remove the code that reads the 32 bytes and detect PE/COFF executables properly.

I took a look at http://cvsweb.netbsd.org/bsdweb.cgi/src/external/bsd/file/dist/magic/magdir/msdos?rev=1.1.1.10&content-type=text/x-cvsweb-markup&only_with_tag=MAIN to see what file does, and that also looks for the PE header to determine if this is a PE/COFF executable.

It's probably not super expensive, but we want library code to behave as nicely as possible in 32-bit, and mapping a potentially huge file could kill a 32-bit process.

zturner updated this revision to Diff 137482.Mar 7 2018, 2:24 PM

Always read the whole file.

rnk added a comment.Mar 7 2018, 3:38 PM

Most callers already map the whole file and then call the other identify_magic overload. Only llvm-pdbutil uses this overload. Maybe we should remove this overload and push the complexity of mapping the file up to there?

@rnk, I like the idea of pushing the mapping behavior to the single use in llvm-pdbutil and removing this overload.

LGTM.

A followup deleting this completely like Reid Kleckner suggests also
LGTM.

Cheers,
Rafael

Zachary Turner via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:

zturner updated this revision to Diff 137482.
zturner added a comment.

Always read the whole file.

https://reviews.llvm.org/D44225

Files:

llvm/lib/BinaryFormat/Magic.cpp

Index: llvm/lib/BinaryFormat/Magic.cpp

  • llvm/lib/BinaryFormat/Magic.cpp

+++ llvm/lib/BinaryFormat/Magic.cpp
@@ -14,6 +14,7 @@
#include "llvm/BinaryFormat/MachO.h"
#include "llvm/Support/Endian.h"
#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/MemoryBuffer.h"

#if !defined(_MSC_VER) && !defined(MINGW32)
#include <unistd.h>
@@ -205,15 +206,12 @@
}

std::error_code llvm::identify_magic(const Twine &Path, file_magic &Result) {

  • int FD;
  • if (std::error_code EC = openFileForRead(Path, FD))
  • return EC;

+ auto FileOrError = MemoryBuffer::getFile(Path);
+ if (!FileOrError)
+ return FileOrError.getError();

  • char Buffer[32];
  • int Length = read(FD, Buffer, sizeof(Buffer));
  • if (close(FD) != 0 || Length < 0)
  • return std::error_code(errno, std::generic_category());

+ std::unique_ptr<MemoryBuffer> FileBuffer = std::move(*FileOrError);
+ Result = identify_magic(FileBuffer->getBuffer());

  • Result = identify_magic(StringRef(Buffer, Length)); return std::error_code(); }

Index: llvm/lib/BinaryFormat/Magic.cpp

  • llvm/lib/BinaryFormat/Magic.cpp

+++ llvm/lib/BinaryFormat/Magic.cpp
@@ -14,6 +14,7 @@
#include "llvm/BinaryFormat/MachO.h"
#include "llvm/Support/Endian.h"
#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/MemoryBuffer.h"

#if !defined(_MSC_VER) && !defined(MINGW32)
#include <unistd.h>
@@ -205,15 +206,12 @@
}

std::error_code llvm::identify_magic(const Twine &Path, file_magic &Result) {

  • int FD;
  • if (std::error_code EC = openFileForRead(Path, FD))
  • return EC;

+ auto FileOrError = MemoryBuffer::getFile(Path);
+ if (!FileOrError)
+ return FileOrError.getError();

  • char Buffer[32];
  • int Length = read(FD, Buffer, sizeof(Buffer));
  • if (close(FD) != 0 || Length < 0)
  • return std::error_code(errno, std::generic_category());

+ std::unique_ptr<MemoryBuffer> FileBuffer = std::move(*FileOrError);
+ Result = identify_magic(FileBuffer->getBuffer());

  • Result = identify_magic(StringRef(Buffer, Length)); return std::error_code(); }

llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

Only if we have a caller that is trying to identify large files but only
ends up using small ones.

Cheers,
Rafael

Zachary Turner via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:

zturner added a comment.

It's probably not super expensive, but we want library code to behave as nicely as possible in 32-bit, and mapping a potentially huge file could kill a 32-bit process.

https://reviews.llvm.org/D44225


llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

I agree. Lets replace the read with MemoryBuffer::getFile. Most callers
will probably use the file anyway and should be using the
identify_magic(StringRef magic) variant.

Cheers,
Rafael

Bob Haarman via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:

inglorion added a comment.

How expensive is it to MemoryBuffer::getFilePath() in all cases? Unless we have reason to believe it is expensive, I would say let's just do that, and then we can remove the code that reads the 32 bytes and detect PE/COFF executables properly.

I took a look at http://cvsweb.netbsd.org/bsdweb.cgi/src/external/bsd/file/dist/magic/magdir/msdos?rev=1.1.1.10&content-type=text/x-cvsweb-markup&only_with_tag=MAIN to see what file does, and that also looks for the PE header to determine if this is a PE/COFF executable.

https://reviews.llvm.org/D44225


llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

This revision was not accepted when it landed; it landed in state Needs Review.Mar 8 2018, 11:49 AM
This revision was automatically updated to reflect the committed changes.