This is an archive of the discontinued LLVM Phabricator instance.

[Support] Improve readNativeFile(Slice) interface
ClosedPublic

Authored by labath on Aug 20 2019, 5:12 AM.

Details

Summary

There was a subtle, but pretty important difference between the Slice
and regular versions of this function. The Slice function was
zero-initializing the rest of the buffer when the read syscall returned
less bytes than expected, while the regular function did not.

This patch removes the inconsistency by making both functions *not*
zero-initialize the buffer. The zeroing code is moved to the
MemoryBuffer class, which is currently the only user of this code. This
makes the API more consistent, and the code shorter.

While in there, I also refactor the functions to return the number of
bytes through the regular return value (via Expected<size_t>) instead of
a separate by-ref argument.

Diff Detail

Repository
rL LLVM

Event Timeline

labath created this revision.Aug 20 2019, 5:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2019, 5:12 AM
Herald added a subscriber: kristina. · View Herald Transcript
aganea added inline comments.Aug 20 2019, 7:38 AM
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

Shouldn't we use ToRead here instead of Buf->getBuffer()? Could you please test two sequential reads in the unittests?

lib/Support/Unix/Path.inc
1005 ↗(On Diff #216104)

The cast is not needed here.

labath marked an inline comment as done.Aug 20 2019, 8:00 AM
labath added inline comments.
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

Woops, indeed it should. Thanks for catching that!

As for the test, I am not sure how I could force the read syscall to return partial data here. For the non-slice version, I suppose I could make this read from a pipe, and then make sure I write to the pipe very slowly. However, the slice version requires a seekable fd, and I don't know what to do there... Do you have any suggestion?

aganea added inline comments.Aug 20 2019, 8:06 AM
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

You can probably call MemoryBuffer::getOpenFile with a smaller FileSize & MapSize and IsVolatile = true? And then a subsequent call by setting Offset > 0? Would that work?

labath marked an inline comment as done.Aug 20 2019, 8:23 AM
labath added inline comments.
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

I don't fully understand what you mean, but I don't see how could that work. Are you expecting that the second time around the OS will return the partial read first, because that data will already be cached and ready? If so, then I would be very surprised if that works, particularly for a file that has just been created because it will likely be present in the disk cache in its entirety.

aganea added inline comments.Aug 20 2019, 8:47 AM
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

You're right. You would need to test it from the shell with something like: cat foo | pv --quiet --line-mode --rate-limit 1 | clang -emit-llvm -o -. I don't know if pv is available everywhere, or if there's a better solution to slow down cat? What do you think?

labath marked an inline comment as done.Aug 20 2019, 9:19 AM
labath added inline comments.
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

Ok, I think I am starting to understand what you mean. There's no problem with slowing down the input for a pipe. I can do that from a unit test relatively easily. The problem is that that the the pipe is not seekable (at least on unix OSs, I don't know what would windows do here). The reason that clang - works is because it invokes a different MemoryBuffer function, which is implemented in terms of readNativeFile and so it works fine with a non-seekable fd. However, calling readNativeFileSlice with a non-seekable fd would result in an error. And unfortunately, I don't know if there's a way to create a seekable fd, which will not satisfy all read requests immediately (short of some FUSE mounts or similarly heroic efforts).

rnk added a comment.Aug 20 2019, 11:11 AM

Seems reasonable aside from the bug. :)

lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

Right, testing a short read from a seekable file is... very hard. IIRC Linux goes out of its way to avoid surfacing short reads if it can. One way to do it would be to make a 6GB file on Windows, since it has the 2GB ReadFile limitation. Of course, I was happy to commit my code without such a test, so I can only apply the same standard to you. :)

aganea added inline comments.Aug 20 2019, 11:22 AM
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

What about this? This does exercise this loop, including the memset.
And it looks like the 2GB is no more (at least on my Windows 10 1803), calling ::ReadFile with (2^32)-1 works as expected. We should probably update the comment in lib/Support/Windows/Path.inc, L1223 :)

TEST_F(FileSystemTest, bigFile) {
  int FD;
  SmallString<64> TempPath;
  ASSERT_NO_ERROR(fs::createTemporaryFile("prefix", "temp", FD, TempPath));

  // Would cause two file reads on Windows, the limit for each read being
  // (2^32)-1
  size_t Size = (1ULL << 32);
  auto E = fs::resize_file(FD, Size);
  if (E) {
    ASSERT_NO_ERROR(fs::remove(Twine(TempPath)));
    return; // skip test is there's enough temp/ space
  }

  auto Buf = MemoryBuffer::getFile(TempPath, Size+1, true, true);
  EXPECT_TRUE((bool)Buf);
  EXPECT_EQ(Buf.get()->getBufferSize(), Size+1);

  ASSERT_NO_ERROR(fs::remove(Twine(TempPath)));
}
labath marked an inline comment as done.Aug 21 2019, 1:59 AM
labath added inline comments.
lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

That would work, if we are OK with creating a 4GB file for the purpose of testing this *and* requiring 4GB of ram to be able to read it.

In fact, linux seems to have the 2GB limitation too (2GB-4KB, to be precise), so this would work there too. Also, most filesystems on linux support sparse files, so we would probably be able to avoid actually wasting that much disk space, if we're really careful. However, the need for a 4GB ram buffer worries me, particularly as detecting OOM conditions reliably is quite hard.

The standard approach to testing code which relies on external stuff which cannot be (easily) controlled in a test is to mock the external component. However, this requires changes to code being tested, such as parameterizing the MemoryBuffer class with a filesystem implementation, and I am not aware of this approach being used anywhere in llvm...

labath updated this revision to Diff 216359.Aug 21 2019, 2:56 AM
  • fix the partial read bug
  • add a partial read test for the streaming case (not tested on windows yet)
aganea accepted this revision.Aug 21 2019, 11:39 AM

LGTM, just a missing include (see below).

  • add a partial read test for the streaming case (not tested on windows yet)

Works on Windows.

lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

I am equally happy to commit that above test later on. I could possibly scope it to Windows and ensure, for example, available memory > 16 GB and available free heap > 4 GB.

unittests/Support/MemoryBufferTest.cpp
23 ↗(On Diff #216359)

Missing:

#ifdef _WIN32
#include <windows.h> 
#endif
This revision is now accepted and ready to land.Aug 21 2019, 11:39 AM
labath marked an inline comment as done.Aug 22 2019, 1:09 AM
  • add a partial read test for the streaming case (not tested on windows yet)

Works on Windows.

Thanks for checking it out.

lib/Support/MemoryBuffer.cpp
467 ↗(On Diff #216104)

Sounds good. I think doing that separately is a good idea. I might be overcautious, but I am expecting it will take several tries before all the bots are happy with a test like that.

This revision was automatically updated to reflect the committed changes.