Page MenuHomePhabricator

[Draft][MSAN] Optimize away poisoning allocas that are always written before load
Needs ReviewPublic

Authored by guiand on Jul 10 2020, 4:10 PM.

Details

Summary

If we know that every path from an alloca leads to a store, we can optimize away poisoning its shadow (it'll be overwritten anyway).

I'm wondering if there's a better approach for finding all these def-uses. I investigated the DominatorTree, but I'm not sure how I would use that to check *all* the uses without blowing up the runtime for instrumenting an alloca to O(N^2) (N=# uses) or so.

Diff Detail

Event Timeline

guiand created this revision.Jul 10 2020, 4:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2020, 4:10 PM

Perhaps MemSSA-based DCE can be taught about it?

guiand marked an inline comment as done.Jul 10 2020, 4:14 PM
guiand added inline comments.
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3848

TODO: get rid of Const in this function name

vitalybuka added inline comments.Jul 10 2020, 4:18 PM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3871

Store may write only part of alloca.
How useful (binary size) the patch as is?

Allocas are often used through bitcast or GEP, we should handle them as well.

I figured if it's using GEP then it's likely not going to be storing to the entire shadow. Bitcast is overwhelmingly common though (especially bitcast->lifetime.start, which I realize I need to handle).

I figured if it's using GEP then it's likely not going to be storing to the entire shadow. Bitcast is overwhelmingly common though (especially bitcast->lifetime.start, which I realize I need to handle).

Right. From what I've seen, small aggregates on stack are quite common as well, and they are always initialized piecemeal. This can be seen as an improvement of this change, but it would be a pretty bit change to the algorithm, so consider including that from the start.

Perhaps MemSSA-based DCE can be taught about it?

That's an option, but it could be harder to pull off when poisoning is outlined. The code will look something like this:

%p = alloca i32
call __msan_poison_and_set_origin(%p, 4)
...
%s_p = inttoptr(xor(ptrtoint(%p), 0x50..00)))  ; shadow address for %p
store i32 zeroinitializer, %s_p
store i32 <value>, %p

To eliminate the dead call to __msan_poison, DCE would need to know the shadow mapping.

guiand updated this revision to Diff 277491.Jul 13 2020, 11:02 AM

I've expanded this to be able to find chained uses like through a bitcast.

Mostly a proof-of-concept; this still needs a check to make sure the final store is over the whole size of the initial type. And, of course, the main issue is that this really slows down compilation. So I think using MemSSA might be necessary for performance reasons, unless there's some algorithm tweak I can make here that would reduce the complexity.

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2020, 11:02 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
fhahn added a comment.Jul 14 2020, 6:43 AM

If we know that every path from an alloca leads to a store, we can optimize away poisoning its shadow (it'll be overwritten anyway).

! In D83595#2145347, @lebedev.ri wrote:

Perhaps MemSSA-based DCE can be taught about it?

MemorySSA-based DSE has similar logic (check that a store is either overwritten or never read on all paths to a function exit), but it uses a MemorySSA traversal to do so. I don't think alloca are modeled directly in MemorySSA though, so the approach cannot be directly applied I think. It might be possible to add a pseudo-memorydef writing to the alloca directly after its definition. Then the MemorySSA based logic may apply.

guiand updated this revision to Diff 277900.Jul 14 2020, 10:54 AM

Added ability to detect piecemeal initialization.

This actually has very significant effects on some, but not all, benchmarks.

Running grep I observed ~8% decrease in binary with this patch. But clang sees little effect: <1%.

On a few different benchmarks from a benchmark suite, this also decreased runtime overhead by a significant amount: 8% for sha512, and 13% for qsort.

This actually has very significant effects on some, but not all, benchmarks.

Running grep I observed ~8% decrease in binary with this patch. But clang sees little effect: <1%.

On a few different benchmarks from a benchmark suite, this also decreased runtime overhead by a significant amount: 8% for sha512, and 13% for qsort.

This sounds worthwhile.

llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3940

I don't think this is entirely correct. Consider this:

entry { alloca 32; store 0..16 } -> A, B
A { store 16..32 } -> B
B { load }

traversing entry->A->B will mark B as done even though there is a path where control flow arrives at B with alloca not entirely initialized.

What happens to the benchmarks if this algorithm is limited to a single basic block?

guiand added inline comments.Jul 30 2020, 3:15 PM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3940

I tried to write the code so that it would look to see if *all* paths are initialized before a load. So in your example, it would look at the path entry -> A -> B, and it would see that it's fully stored, and it would continue looking at the next path. For entry -> B, if (!firstUsesAreStore) return false would come into effect, and we won't optimize away the poisoning.

guiand added inline comments.Jul 30 2020, 3:21 PM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3940

Oh, I see where the issue is here: the if TraversedSet.count check. I needed some way to prevent cycles in the graph, so that was the first thing I reached for, but you're right, this approach breaks the algorithm.

eugenis added inline comments.Jul 30 2020, 3:31 PM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3940

Right. And it does not even require a cycle.

This is a dataflow problem - assign labels to graph nodes showing which bytes are always initialized at entry, and update them based on predecessor's labels until the fixed point is reached. But in this case, I suspect, most of the optimization opportunities are within a single BB.

guiand added inline comments.Jul 30 2020, 3:47 PM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3940

I think this could be fixed by removing the target node from the set after the recursive call is finished. But I'll measure the impact of removing the inter-block stuff entirely to see how much value it adds.

guiand updated this revision to Diff 282099.Jul 30 2020, 6:40 PM

I've cut it down to only within a basic block as @eugenis and @vitalybuka suggested.

In general, this implementation looks pretty complex and easy to get wrong. I'd prefer something along the lines of AArch64StackTagging::collectInitializers - directly calculate the offset for each store/load instruction. It might do some extra work with unrelated memory instructions, but probably not too much.

How do you handle this case?

a = alloca
b = bitcast a
lifetime_start b
store b

When scanning from lifetime_start, this code will never encounter any direct use of a, and would miss the transitive use.

Missing diff context.

llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3895

prefer early exits / continue(s)

3915

This function is never called with StoreOffs != 0, which seems necessary to handle a chain of GEPs.

3920

Only if same BB.

guiand updated this revision to Diff 282317.Jul 31 2020, 2:43 PM
guiand marked 3 inline comments as done.

Flattened some control flow, updated to properly use StoreOffs, and updated tests to cover chained GEPs

In general, this implementation looks pretty complex and easy to get wrong. I'd prefer something along the lines of AArch64StackTagging::collectInitializers - directly calculate the offset for each store/load instruction. It might do some extra work with unrelated memory instructions, but probably not too much.

I'll take a look at collectInitializers. As for the current implementation -- yeah, I always figured there would be a better way. But I tried to be pretty conservative with how I implemented it, so while we might miss some stores, we should never "forget" to poison an alloca.

How do you handle this case?

a = alloca
b = bitcast a
lifetime_start b
store b

When scanning from lifetime_start, this code will never encounter any direct use of a, and would miss the transitive use.

The code currently scans from the alloca, rather than from the lifetime_start. This might make only searching in single BB pretty limiting, since afaict an alloca can be detached from its lifetime region.

The code currently scans from the alloca, rather than from the lifetime_start. This might make only searching in single BB pretty limiting, since afaict an alloca can be detached from its lifetime region.

Oh right. That's actually pretty weak, it's very common for lifetime to start in a basic block other than entry.

You want to start scanning at the same point where poisoning is going to be inserted.

It seems like collectInitializers leans heavily on the isPointerOffset function, which returns an offset if two pointers have a constant difference, nullopt if they don't. The problem here is that we can't distinguish isPointerOffset == nullopt happening because the offset is determined at runtime, or because the two pointers are completely unrelated.

It's a pretty big difference, because we don't want to poison a sequence like this:

%x = alloca [ i32, i32 }
%y = alloca i32
%z = load i32, i32* %y ; isPointerOffset == false
%x0 = getelementptr { i32, i32 }, { i32, i32 }* %x, i32 0, i32 0
%x1 = getelementptr { i32, i32 }, { i32, i32 }* %x, i32 0, i32 1
store i32 0, i32* %x0
store i32 0, i32* %x1

But we want to poison a sequence like this:

%x = alloca [ i32, i32 }
%y = getelementptr { i32, i32 }, { i32, i32 }* %x, i32 0, i32 %dynamic_offs
%z = load i32, i32* %y ; isPointerOffset == false
%x0 = getelementptr { i32, i32 }, { i32, i32 }* %x, i32 0, i32 0
%x1 = getelementptr { i32, i32 }, { i32, i32 }* %x, i32 0, i32 1
store i32 0, i32* %x0
store i32 0, i32* %x1

Right, that's what the isNoModRef check is for.

guiand updated this revision to Diff 282982.Aug 4 2020, 11:47 AM

Integrated with Alias Analyzer, uses simpler mechanism for walking through BB and determining stores to alloca

guiand added inline comments.Aug 4 2020, 11:53 AM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3848

No longer needed

3850

No longer needed

3871

It depends a lot on the program. Grep saw like a 8% decrease in binary size, while clang saw 0.5% to 1% decrease.

vitalybuka added inline comments.Aug 18 2020, 7:32 PM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3927

not sure why StoredBytes is a parameter and not just a local var in firstUsesAreStore

guiand added inline comments.Aug 22 2020, 12:32 PM
llvm/lib/Transforms/Instrumentation/MemorySanitizer.cpp
3927

True, that was leftover from a previous version of this patch.