This is an archive of the discontinued LLVM Phabricator instance.

[YAMLIO] Support non-null-terminated inputs
ClosedPublic

Authored by scott.linder on Jul 17 2020, 10:08 AM.

Details

Summary

In some places the parser guards against dereferencing End, while in
others it relies on the presence of a trailing '\0' to elide checks.

Add the remaining guards needed to ensure the parser never attempts to
dereference End, making it safe to not require a null-terminated input
buffer.

Update fuzzer harness so that it doesn't ensure null-terminated inputs.

Some of the regression tests were written by inspection, and some are
cases caught by the fuzzer which required additional fixes in the
parser.

Diff Detail

Event Timeline

arsenm created this revision.Jul 17 2020, 10:08 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2020, 10:08 AM

The parser used to rely on the \0 being present. Was that fixed?

The parser used to rely on the \0 being present. Was that fixed?

Do you know where this assumption was? I couldn't find it, but also couldn't find any commits that seemed to "fix" it. I haven't seen a sanitizer error yet

dexonsmith requested changes to this revision.Oct 6 2020, 3:32 PM

The parser used to rely on the \0 being present. Was that fixed?

Do you know where this assumption was? I couldn't find it, but also couldn't find any commits that seemed to "fix" it. I haven't seen a sanitizer error yet

I just did a quick scan and I can't find any such assumption either. @Bigcheese, do remember what the original intent was?

llvm/lib/Support/YAMLParser.cpp
773

Please spell this /*RequiresNullTerminator=*/false.

This revision now requires changes to proceed.Oct 6 2020, 3:32 PM
grimar added a subscriber: grimar.Oct 6 2020, 11:58 PM

This should have a unittest probably, if fixes something.

scott.linder added a subscriber: scott.linder.

Try to keep the parser from running off the end of the buffer and dereferencing
the result; add some unit tests to try to stress pieces being changed.

scott.linder commandeered this revision.Nov 11 2020, 3:20 PM
scott.linder added a reviewer: arsenm.
scott.linder marked an inline comment as done.Nov 11 2020, 3:27 PM

I tried to address the issues brought up, but I'm not particularly confident I found every case in the parser where it assumes End is OK to dereference. I think the presence of a null-terminator just made these bugs harder to detect, but as far as I can tell they are in fact bugs (i.e. End is documented as being one-past-the-end, and a lot of other places do guard against Current == End).

I didn't do a stellar job with unit testing either, I essentially just tried to produce at least enough test cases to stress the changes I made. I considered updating all of the existing tests to not use null-terminated buffers, or to check with both, but wanted to see what people thought first.

As a note, the tests will only deterministically fail with ASan enabled, but I don't know a reliable way to avoid that.

Ignore this diff, I thought I had run check-all but evidently I hadn't. I'll look at what I broke in the spec tests and post a revised patch.

As a note, the tests will only deterministically fail with ASan enabled, but I don't know a reliable way to avoid that.

That's fine, we'll always have ASan bots to catch these. Not ideal, but reasonable for this kind of situation.

The parser used to rely on the \0 being present. Was that fixed?

Do you know where this assumption was? I couldn't find it, but also couldn't find any commits that seemed to "fix" it. I haven't seen a sanitizer error yet

I just did a quick scan and I can't find any such assumption either. @Bigcheese, do remember what the original intent was?

All I recall at this point is that I required a null terminator for the same reason Clang does, to remove the constant need to check for Current != End. This probably makes a lot more sense for Clang to do as it's rare to compile source code not in a file, which may not always be true for the YAML parser. I'm fine with removing that assumption.

Restore the correct behavior of Scanner::scanAliasOrAnchor, last version of the patch mistakenly changed Token.Range when only intending to fix the error checking around an empty alias/anchor.

I think this version now removes the assumption that End points to a dereferencable '\0', but I'm not sure how to go about testing this comprehensively.

Restore the correct behavior of Scanner::scanAliasOrAnchor, last version of the patch mistakenly changed Token.Range when only intending to fix the error checking around an empty alias/anchor.

I think this version now removes the assumption that End points to a dereferencable '\0', but I'm not sure how to go about testing this comprehensively.

One idea would be to hack the parser to always duplicate the input, append a bad character (like '\1'), and parse the new buffer with the original bounds. If something reads too far it would likely get upset at the bad character. Running the test suites for LLVM and Clang with that hack might be enough coverage.

Is there a libfuzzer instance for the yaml parser? That's probably the ideal thing to do, and if it's already set up maybe it's not too hard.

dexonsmith requested changes to this revision.Nov 13 2020, 1:37 PM
dexonsmith added inline comments.
llvm/lib/Support/YAMLParser.cpp
1409–1410

I don't have an opinion on the best order of these two lines, but if you want to flip them please do so in a separate NFC commit (and revert the change from this patch to make it as clean as possible).

1411

Please run git-clang-format or similar to fix the spacing on the lines touched.

1424

This looks like a fix for a logic bug, not directly related to null-termination; can it be committed separately ahead of time? Please also add a test that covers it.

This revision now requires changes to proceed.Nov 13 2020, 1:37 PM
scott.linder retitled this revision from YAML: Don't assume an arbitrary StringRef is null terminated to [YAMLIO] Support non-null-terminated inputs.
scott.linder edited the summary of this revision. (Show Details)

Address feedback

  • Correct a spurious reordering of some lines
  • Format touched lines
  • Split off semantically distinct patch for anchor/alias diagnostic bug
scott.linder marked 3 inline comments as done.
scott.linder added inline comments.
llvm/lib/Support/YAMLParser.cpp
1409–1410

My mistake, I didn't look over the diff closely enough before posting. Fixed!

scott.linder marked an inline comment as done.
scott.linder edited the summary of this revision. (Show Details)

I ran the fuzzer and quickly found two places I had missed. I've been running the fuzzer for about a half hour now and haven't seen another crash, so I am tentatively posting the patch. I'll leave the fuzzer running for as long as I reasonably can to try to find additional cases.

Restore the correct behavior of Scanner::scanAliasOrAnchor, last version of the patch mistakenly changed Token.Range when only intending to fix the error checking around an empty alias/anchor.

I think this version now removes the assumption that End points to a dereferencable '\0', but I'm not sure how to go about testing this comprehensively.

One idea would be to hack the parser to always duplicate the input, append a bad character (like '\1'), and parse the new buffer with the original bounds. If something reads too far it would likely get upset at the bad character. Running the test suites for LLVM and Clang with that hack might be enough coverage.

Is there a libfuzzer instance for the yaml parser? That's probably the ideal thing to do, and if it's already set up maybe it's not too hard.

Also thank you for mentioning libFuzzer! It was really painless and feels like the best thing short of a proof to validate the change.

dexonsmith added inline comments.Nov 17 2020, 7:10 PM
llvm/tools/llvm-yaml-parser-fuzzer/yaml-parser-fuzzer.cpp
13–14

I wonder if it would also be useful to have logic like this:

auto testYaml = [](const uint8_t *Data, size_t Size) {
  // return true if it parses.
};

// Test that there's no crash when parsing the raw data.
testYaml(Data, Size);

// Test with 0s filtered out.
std::string Input(reinterpret_cast<const char *>(Data), Size);
Input.erase(std::remove(Input.begin(), Input.end(), 0), Input.end());
Size = Input.size();
bool TerminatedBy0 = testYaml(Input.data(), Size);

// Test that an invalid character after the input has no effect.
Input.push_back(1);
if (testYaml(Input.data(), Size) != TerminatedBy0)
  __builtin_trap(...);
return 0;

I added something along the lines of what you proposed, let me know what you
think. I also switched to a vector<uint8_t> to get more explicit control over
the terminator, and to yaml::Stream to be able to directly call validate()
such that every document is checked (in case the fuzzer produces inputs with
multiple documents).

I had also considered trying to duplicate the existing parser to do comparitive
testing, but I wasn't sure the best way to go about it. I could just build the
compiler twice, and then have an instance of the fuzzer feed into both and
compare their results. Let me know if you would find this helpful, and I can
do the test and just not commit the result.

scott.linder marked an inline comment as done.Nov 18 2020, 12:24 PM

I also added calls to vector::shrink_to_fit to maybe make it slightly more likely for a sanitizer without c++ STL annotations to catch out-of-bounds accesses, but if this is just noise I'm happy to remove them.

Remove one redundant call to vector::shrink_to_fit

dexonsmith accepted this revision.Nov 18 2020, 1:40 PM

This LGTM now, thanks for integrating this into the fuzzer! I have a couple of other suggestions, but it's up to you whether to apply them.

llvm/tools/llvm-yaml-parser-fuzzer/yaml-parser-fuzzer.cpp
11

It may be nice to have a using namespace llvm; here and remove the llvm::s below. Up to you.

12

This could probably be isValidYaml to silence the clang-tidy warning. LLVM_FuzzerTestOneInput has to be how it us, but we could still modify this.

This revision is now accepted and ready to land.Nov 18 2020, 1:40 PM
This revision was landed with ongoing or failed builds.Nov 18 2020, 3:06 PM
This revision was automatically updated to reflect the committed changes.