identify_magic on COFF executables has been broken for apparently quite some time. The reason is because we only read the first 32 bytes of the file, but the detection algorithm for COFF requires at *minimum* 64 bytes, but in practice you need to read the entire file because the algorithm reads a 4-byte value at offset 60 and then treats that as an offset, from which it reads additional data.
Details
Diff Detail
Event Timeline
Does anyone know why we bother with the "PE\0\0" validation instead of just relying on the "MZ" magic check? File magic checks are supposed to be heuristic anyway.
It's true that file magic checks are heuristic, but having it rely on only 2 bytes will have quite a lot of false positives on binary files.
How expensive is it to MemoryBuffer::getFilePath() in all cases? Unless we have reason to believe it is expensive, I would say let's just do that, and then we can remove the code that reads the 32 bytes and detect PE/COFF executables properly.
I took a look at http://cvsweb.netbsd.org/bsdweb.cgi/src/external/bsd/file/dist/magic/magdir/msdos?rev=1.1.1.10&content-type=text/x-cvsweb-markup&only_with_tag=MAIN to see what file does, and that also looks for the PE header to determine if this is a PE/COFF executable.
It's probably not super expensive, but we want library code to behave as nicely as possible in 32-bit, and mapping a potentially huge file could kill a 32-bit process.
Most callers already map the whole file and then call the other identify_magic overload. Only llvm-pdbutil uses this overload. Maybe we should remove this overload and push the complexity of mapping the file up to there?
@rnk, I like the idea of pushing the mapping behavior to the single use in llvm-pdbutil and removing this overload.
LGTM.
A followup deleting this completely like Reid Kleckner suggests also
LGTM.
Cheers,
Rafael
Zachary Turner via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:
zturner updated this revision to Diff 137482.
zturner added a comment.Always read the whole file.
https://reviews.llvm.org/D44225
Files:
llvm/lib/BinaryFormat/Magic.cppIndex: llvm/lib/BinaryFormat/Magic.cpp
- llvm/lib/BinaryFormat/Magic.cpp
+++ llvm/lib/BinaryFormat/Magic.cpp
@@ -14,6 +14,7 @@
#include "llvm/BinaryFormat/MachO.h"
#include "llvm/Support/Endian.h"
#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/MemoryBuffer.h"#if !defined(_MSC_VER) && !defined(MINGW32)
#include <unistd.h>
@@ -205,15 +206,12 @@
}std::error_code llvm::identify_magic(const Twine &Path, file_magic &Result) {
- int FD;
- if (std::error_code EC = openFileForRead(Path, FD))
- return EC;
+ auto FileOrError = MemoryBuffer::getFile(Path);
+ if (!FileOrError)
+ return FileOrError.getError();
- char Buffer[32];
- int Length = read(FD, Buffer, sizeof(Buffer));
- if (close(FD) != 0 || Length < 0)
- return std::error_code(errno, std::generic_category());
+ std::unique_ptr<MemoryBuffer> FileBuffer = std::move(*FileOrError);
+ Result = identify_magic(FileBuffer->getBuffer());
- Result = identify_magic(StringRef(Buffer, Length)); return std::error_code(); }
Index: llvm/lib/BinaryFormat/Magic.cpp
- llvm/lib/BinaryFormat/Magic.cpp
+++ llvm/lib/BinaryFormat/Magic.cpp
@@ -14,6 +14,7 @@
#include "llvm/BinaryFormat/MachO.h"
#include "llvm/Support/Endian.h"
#include "llvm/Support/FileSystem.h"
+#include "llvm/Support/MemoryBuffer.h"#if !defined(_MSC_VER) && !defined(MINGW32)
#include <unistd.h>
@@ -205,15 +206,12 @@
}std::error_code llvm::identify_magic(const Twine &Path, file_magic &Result) {
- int FD;
- if (std::error_code EC = openFileForRead(Path, FD))
- return EC;
+ auto FileOrError = MemoryBuffer::getFile(Path);
+ if (!FileOrError)
+ return FileOrError.getError();
- char Buffer[32];
- int Length = read(FD, Buffer, sizeof(Buffer));
- if (close(FD) != 0 || Length < 0)
- return std::error_code(errno, std::generic_category());
+ std::unique_ptr<MemoryBuffer> FileBuffer = std::move(*FileOrError);
+ Result = identify_magic(FileBuffer->getBuffer());
- Result = identify_magic(StringRef(Buffer, Length)); return std::error_code(); }
llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
Only if we have a caller that is trying to identify large files but only
ends up using small ones.
Cheers,
Rafael
Zachary Turner via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:
zturner added a comment.
It's probably not super expensive, but we want library code to behave as nicely as possible in 32-bit, and mapping a potentially huge file could kill a 32-bit process.
https://reviews.llvm.org/D44225
llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
I agree. Lets replace the read with MemoryBuffer::getFile. Most callers
will probably use the file anyway and should be using the
identify_magic(StringRef magic) variant.
Cheers,
Rafael
Bob Haarman via Phabricator via llvm-commits
<llvm-commits@lists.llvm.org> writes:
inglorion added a comment.
How expensive is it to MemoryBuffer::getFilePath() in all cases? Unless we have reason to believe it is expensive, I would say let's just do that, and then we can remove the code that reads the 32 bytes and detect PE/COFF executables properly.
I took a look at http://cvsweb.netbsd.org/bsdweb.cgi/src/external/bsd/file/dist/magic/magdir/msdos?rev=1.1.1.10&content-type=text/x-cvsweb-markup&only_with_tag=MAIN to see what file does, and that also looks for the PE header to determine if this is a PE/COFF executable.
https://reviews.llvm.org/D44225
llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits