This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add sanitizer instrumentation to the SIMD strlen implementation.
Needs ReviewPublic

Authored by sivachandra on Apr 21 2020, 10:08 PM.

Details

Reviewers
hctim
Summary

Depends on D78611.

Diff Detail

Event Timeline

sivachandra created this revision.Apr 21 2020, 10:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2020, 10:08 PM
  • clang-format

I do not think this patch is ready. But, I am sending this out for review by @hctim to get his feedback about the hand-written sanitizer instrumentation in strlen.cpp and the corresponding testing in strlen_test.cpp.

I'm not sure if this discussion belongs here or D77949, but as this complicates that patch even more I will do so here.

I really wonder if the added complexity is worth the speedup we are hoping to achieve. Firstly even if this implements ASan which is by far the most common and ubiquitously useful, it makes sense to implement it here, we also have fuzz tests set up which rely on -fsanitize=fuzzer. That won't be implemented here. The whole appeal of sanitizers is that nothing needs to change in our code, but it's not true here no matter what we do if we rely on UB.

This patch is well done and it's very clever and easy to understand way to implement by hand what -fsanitize=address is doing. But it is undeniably making this much less maintainable. Is the cost we are paying in maintainability worth the speedup we are hoping to get using this implementation? If this patch is only for strlen and not an example for how we will handle all such cases which rely on UB for clever tricks, then I would say it is not worth it.

FWIW all of that would be moot if this was performance critical, but I'm not convinced that strlen is a particularly hot function, especially given its use outside of C is very limited. Even in C++ it would be used much less frequently and for programs written in other languages, that would be even less than that is my guess. Even if it were a hot function, how large does the string need to be before the implementation in D77949 is better than what we currently have?

I'm not sure if this discussion belongs here or D77949, but as this complicates that patch even more I will do so here.

I really wonder if the added complexity is worth the speedup we are hoping to achieve. Firstly even if this implements ASan which is by far the most common and ubiquitously useful, it makes sense to implement it here, we also have fuzz tests set up which rely on -fsanitize=fuzzer. That won't be implemented here. The whole appeal of sanitizers is that nothing needs to change in our code, but it's not true here no matter what we do if we rely on UB.

This patch is well done and it's very clever and easy to understand way to implement by hand what -fsanitize=address is doing. But it is undeniably making this much less maintainable. Is the cost we are paying in maintainability worth the speedup we are hoping to get using this implementation? If this patch is only for strlen and not an example for how we will handle all such cases which rely on UB for clever tricks, then I would say it is not worth it.

Yes, the idea is to use strlen as the first example. The approach we want to take with LLVM libc is to hand craft the instrumentation vs using interceptors as is done for other libcs. So, we want to start learning about such hand-crafted instrumentation using this "simple" example.

About other sanitizers, that is a very valid question. I want @hctim or other sanitizer people to tell us more. FWIW, I heard of informal interest from glibc developers also about using sanitizer instrumentation. Currently, each sanitizer has its own interface API. I would really prefer that all of them are unified in some manner to make them simpler to use. So, my personal view is that we might be heading towards a more generalized/standardized sanitizer API.

FWIW all of that would be moot if this was performance critical, but I'm not convinced that strlen is a particularly hot function, especially given its use outside of C is very limited. Even in C++ it would be used much less frequently and for programs written in other languages, that would be even less than that is my guess. Even if it were a hot function, how large does the string need to be before the implementation in D77949 is better than what we currently have?

I do not have any benchmarks to share or point to wrt strlen particularly. I have done some asking around, and the feedback I received is that we should match at least glibc's performance.

Also adding @phosek who can share his view/opinion about this.

I personally think this is too clever. The whole templating thing seems like a good overall QOL improvement - but the whole unsanitized safe_check_word that we have to manually sanitize seems challenging to maintain. You would need to do sanitizer instrumentation manually for ASan, MSan, UBSan, HWASan, MTE, DFSan, and any other fun toys we come up with. Each of these has a different interface, and different quirks. Why not just let the compiler do these quirks?

Instead, can't we just use the slow path always under sanitizers? i.e.:

#if __has_feature(address_sanitizer) || defined(__SANITIZE_ADDRESS__)
  static constexpr bool word_optimized = false;
#else // ASan
  static constexpr bool word_optimized = Mask<sizeof(uintptr_t)>::opt;
#endif

Then, the whole dancing-around-managing sanitizers is solved as the compiler just does normal instrumentation of the slow path.

libc/test/src/string/strlen_test.cpp
70

Wouldn't this be simpler just as const char *hello = "hello"? We definitely shouldn't be manually doing asan-ish things in the test.

sivachandra marked an inline comment as done.Apr 22 2020, 10:41 AM

I personally think this is too clever. The whole templating thing seems like a good overall QOL improvement - but the whole unsanitized safe_check_word that we have to manually sanitize seems challenging to maintain. You would need to do sanitizer instrumentation manually for ASan, MSan, UBSan, HWASan, MTE, DFSan, and any other fun toys we come up with. Each of these has a different interface, and different quirks. Why not just let the compiler do these quirks?

Sorry, the templating thing is probably a distraction at this point.

With respect to the sanitizer parts: In this patch, strlen is an example. With the present sanitizer API, wouldn't hand-crafting instrumentation mean that we have to manually deal with multiple sanitizers in general anyway?

Instead, can't we just use the slow path always under sanitizers? i.e.:

#if __has_feature(address_sanitizer) || defined(__SANITIZE_ADDRESS__)
  static constexpr bool word_optimized = false;
#else // ASan
  static constexpr bool word_optimized = Mask<sizeof(uintptr_t)>::opt;
#endif

Then, the whole dancing-around-managing sanitizers is solved as the compiler just does normal instrumentation of the slow path.

My view is that this just defeats the purpose of sanitizing. Also, we will be setting an example which says, "it is OK to keep fast paths unsanitized."

There is an alternate approach we can take here: We can keep the safe_word_check function unsanitized without any instrumentation. Since the main function is sanitized, it will catch bad memory reads as it does the reading again anyway. This is probably a nice compromise, but I would like to hear about the spirit of such an approach: we are not sanitizing one function at all.

libc/test/src/string/strlen_test.cpp
70

Other tests do the equivalent. The literal string ends up in the data section which is properly aligned. So, will not trigger an ASAN error as the word read is still within the data section?

tschuett added a comment.EditedApr 22 2020, 10:42 AM

I am starting to believe that it is better to stay with the slow and safe strlen implementation. Avoid the sanitizer dancing. Spend more energy on adding features. So that people start using this libc. Then you can use a profiler to find the hot functions.

I am starting to believe that it is better to stay with the slow and safe strlen implementation. Avoid the sanitizer dancing. Spend more energy on adding features. So that people starting using this libc. Then you can use a profiler to find the hot functions.

Making LLVM libc sanitizer friendly is a major goal. So, sorting out sanitizer issues here is not a waste but actually contributing to a feature of LLVM libc.

Sorry, your patch had to be hijacked. But, I felt strlen was simple enough to understand and set an example with.

Sorry, my mistake. With "avoid the sanitizer dancing" I meant that the compiler should do the instrumentation instead of doing it manually, which is error-prone.

I personally think this is too clever. The whole templating thing seems like a good overall QOL improvement - but the whole unsanitized safe_check_word that we have to manually sanitize seems challenging to maintain. You would need to do sanitizer instrumentation manually for ASan, MSan, UBSan, HWASan, MTE, DFSan, and any other fun toys we come up with. Each of these has a different interface, and different quirks. Why not just let the compiler do these quirks?

Sorry, the templating thing is probably a distraction at this point.

With respect to the sanitizer parts: In this patch, strlen is an example. With the present sanitizer API, wouldn't hand-crafting instrumentation mean that we have to manually deal with multiple sanitizers in general anyway?

Instead, can't we just use the slow path always under sanitizers? i.e.:

#if __has_feature(address_sanitizer) || defined(__SANITIZE_ADDRESS__)
  static constexpr bool word_optimized = false;
#else // ASan
  static constexpr bool word_optimized = Mask<sizeof(uintptr_t)>::opt;
#endif

Then, the whole dancing-around-managing sanitizers is solved as the compiler just does normal instrumentation of the slow path.

My view is that this just defeats the purpose of sanitizing. Also, we will be setting an example which says, "it is OK to keep fast paths unsanitized."

I agree, sanitizers are useful for detecting problems in programs using libc and detecting problems with the libc itself. It's really the "fast path" where we want the second kind, because it's more complicated and harder to see if it's safe. @sivachandra is right, and thats not an example I would want to set here.

There is an alternate approach we can take here: We can keep the safe_word_check function unsanitized without any instrumentation. Since the main function is sanitized, it will catch bad memory reads as it does the reading again anyway. This is probably a nice compromise, but I would like to hear about the spirit of such an approach: we are not sanitizing one function at all.

This makes a lot of sense, I am in strong favor of this approach. We keep both the performance of D77949 and we achieve the goal of easy sanitization. Theres also a lot of merit in pulling out the unsafe bits of a function and explicitly marking them as such, it documents we understand it is UB and is a good starting point if there are ever any bugs.

There is an alternate approach we can take here: We can keep the safe_word_check function unsanitized without any instrumentation. Since the main function is sanitized, it will catch bad memory reads as it does the reading again anyway. This is probably a nice compromise, but I would like to hear about the spirit of such an approach: we are not sanitizing one function at all.

This makes a lot of sense, I am in strong favor of this approach. We keep both the performance of D77949 and we achieve the goal of easy sanitization. Theres also a lot of merit in pulling out the unsafe bits of a function and explicitly marking them as such, it documents we understand it is UB and is a good starting point if there are ever any bugs.

For completeness, I would like to point out that it will not catch bad strings. But, I am not sure if it is the job of strlen to catch bad strings. If we ensure that the implementation is safe for good strings, we are probably good.

Do what ever you want, but I would go with the slow and safe strlen implementation until somebody proves that strlen is a performance issue.

My view is that this just defeats the purpose of sanitizing. Also, we will be setting an example which says, "it is OK to keep fast paths unsanitized."

I agree, sanitizers are useful for detecting problems in programs using libc and detecting problems with the libc itself. It's really the "fast path" where we want the second kind, because it's more complicated and harder to see if it's safe. @sivachandra is right, and thats not an example I would want to set here.

From a libc-user's perspective - I want to know if my call to libc was buggy. How that function is implemented is not really important (i.e. whether I get one implementation or another) - and this is how the interceptors work currently with glibc. I don't expect that a sanitizer-friendly implementation would be more than 2x slower than a "fast variant".

From a libc-developer's perspective - I completely understand that having a sanitizer-friendly implementation and a sanitizer-unfriendly implementation is undesirable. The tradeoff here is an enormous maintenance burden. By emulating ASan's behaviour here - it's creating the expectation to emulate all of MSan, TSan, HWASan, UBSan, MTE and DFSan's behaviours as well. This turns strlen() into hundreds of lines of sanitizer-specific behaviour and tricks.

There is an alternate approach we can take here: We can keep the safe_word_check function unsanitized without any instrumentation. Since the main function is sanitized, it will catch bad memory reads as it does the reading again anyway. This is probably a nice compromise, but I would like to hear about the spirit of such an approach: we are not sanitizing one function at all.

This makes a lot of sense, I am in strong favor of this approach. We keep both the performance of D77949 and we achieve the goal of easy sanitization. Theres also a lot of merit in pulling out the unsafe bits of a function and explicitly marking them as such, it documents we understand it is UB and is a good starting point if there are ever any bugs.

This approach works well here. Bearing in mind that it doesn't work trivially for anything that doesn't "double back" on the sanitizer-friendly version. Think SIMD strcpy() + MSan.

libc/test/src/string/strlen_test.cpp
70

ASan doesn't care about alignment. It stores a description of the real size of the global - which is caught by partially addressable dwords.

Calling __asan_region_is_poisoned on every word will have very poor performance under ASan. Not sanitizing that entire function would allow strlen to "hop over" inaccessible regions as long as the (alleged) end of the string is inside a valid allocation.

IMHO, running slow path under asan is not that bad, on the balance. The fast path is simple enough to verify by code inspection. Of course, we should only do this for truly simple and performance-critical functions.

Another option is to run the fast path unsanitized, and then check that the reported string bytes are actually accessible according to ASan. That's what the interceptors do. It is possible that the fast path escapes, hits an unmapped page and dies, but we have not seen in practice at all.

From a libc-developer's perspective - I completely understand that having a sanitizer-friendly implementation and a sanitizer-unfriendly implementation is undesirable. The tradeoff here is an enormous maintenance burden. By emulating ASan's behaviour here - it's creating the expectation to emulate all of MSan, TSan, HWASan, UBSan, MTE and DFSan's behaviours as well. This turns strlen() into hundreds of lines of sanitizer-specific behaviour and tricks.

Totally agree that emulating every sanitizer's behavior is not the approach we should be taking. This is probably dumb, but like I have mentioned previously, is it possible to unify/simplify the API of the various sanitizers?

Since everyone from sanitizers is subscribed on this change, let me ask few general questions:

  1. Should LLVM libc have code which relies on, for example, the undefined behavior like in the implementation of strlen in the change?
  2. Should LLVM libc have hand-crafted sanitizer instrumentation at all? If yes, what should be the criteria to determine whether it is required?
  1. Should LLVM libc have code which relies on, for example, the undefined behavior like in the implementation of strlen in the change?

IMHO, yes, if you can show meaningful performance improvement on non-synthetic benchmarks.

  1. Should LLVM libc have hand-crafted sanitizer instrumentation at all? If yes, what should be the criteria to determine whether it is required?

If the answer to the previous question is yes, then it is pretty much required.
Even w/o UB, sanitizers may need source-level help. For example, TSan may need happens-before annotations around atomic-based (or external to the process) synchronization in some cases for correctness.

About unifiying the sanitizer API. Do you have any concrete suggestions?
Sanitizers are generally interested in different code properties. For example this strlen algorithm would be an issue for addressability checkers only - asan and hwasan.
TSan is entirely different and would be interested in synchronization events but not much else.

As a LibreOffice committer, strlen and friends are definitely hot :-)

See also https://reviews.llvm.org/D129808 (cc @michaelrj )

strlen was the in the top 4 hottest libc routines in my statically linked image of clang when building the Linux kernel.

Performance numbers from @michaelrj here: https://quick-bench.com/q/Z7wq6I3K6xUIeAIogIpHpPgBwSw

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 9:14 PM
Herald added a subscriber: ecnelises. · View Herald Transcript