Page MenuHomePhabricator

[asan] Support dynamic shadow address instrumentation
ClosedPublic

Authored by etienneb on Aug 10 2016, 8:35 AM.

Details

Summary

This patch is adding the support for a shadow memory with
dynamically allocated address range.

The compiler-rt needs to export a symbol containing the shadow
memory range.

This is required to support ASAN on windows 64-bits.

Event Timeline

etienneb updated this revision to Diff 67526.Aug 10 2016, 8:35 AM
etienneb retitled this revision from to [compiler-rt] Suport dynamic shadow address instrumentation.
etienneb updated this object.
etienneb added reviewers: rnk, kcc.
etienneb added subscribers: chrisha, llvm-commits.
etienneb updated this revision to Diff 67529.Aug 10 2016, 8:41 AM

more nits

etienneb updated this revision to Diff 67530.Aug 10 2016, 8:43 AM

moar nits

rnk added inline comments.Aug 10 2016, 9:53 AM
lib/Transforms/Instrumentation/AddressSanitizer.cpp
464

Can we make a kDynamicShadowSentinel for this?

945

LLVM typically uses right leaning pointers and references.

1822

Why create a local alloca and then load from it? This might end up being really slow in -O0 if we reload the shadow base from the stack before every user memory access. If we know we're in the entry block, we should be able to skip the alloca and change this to:

Value *GlobalDynamicAddress = F.getParent()->getOrInsertGlobal(
    kAsanShadowMemoryDynamicAddress, IntptrTy);
LocalDynamicShadow = IRB.CreateLoad(GlobalDynamicAddress);

We can then skip the extra load in memToShadow and use LocalDynamicShadow directly.

etienneb updated this revision to Diff 67547.Aug 10 2016, 10:26 AM
etienneb marked 3 inline comments as done.

fix rnk@ comments

FYI: This should be landed with the patch I'm completing (compiler-rt changes to expose the shadow address).

lib/Transforms/Instrumentation/AddressSanitizer.cpp
1822

Turns out that it's more efficient to put it on the stack for -O1 and -O2.
The alloca-slot is promoted to register for the scope of the function.
This is more efficient than having a load from memory for every instrumented load/store.

Even for /O0, the code should be better.
For every instrumented load/store, the code is slighty better (loading from the stack instead of global address).
The only added cost is the required instructions to copy the shadow address on the stack.

rnk added inline comments.Aug 10 2016, 10:42 AM
lib/Transforms/Instrumentation/AddressSanitizer.cpp
1822

I'm not proposing loading from the global on every shadow memory access, I'm saying you should load once from the global in the entry block and use that load everywhere, rather than spilling it to the stack immediately and reloading before every shadow check. With O1 and higher, mid-level optimizers will immediately do this transform for you, but it's simpler to emit the efficient code in the first place. I'm not sure what will happen at -O0 since the fast register allocator is pretty bad.

etienneb added inline comments.Aug 10 2016, 10:50 AM
lib/Transforms/Instrumentation/AddressSanitizer.cpp
1822

As discussed offline, as soon as the entry block dominates all other basic block.... using the result from "load" in the entry block should be enough.

I'm gonna make a patch to try it.
thx rnk for the explanation.

kcc edited edge metadata.Aug 10 2016, 11:43 AM

Wait, wait, this utterly sucks.
I need to know that we've exhausted all other possibilities before reviewing this.
Did we?

etienneb retitled this revision from [compiler-rt] Suport dynamic shadow address instrumentation to [compiler-rt] Support dynamic shadow address instrumentation.Aug 10 2016, 11:53 AM
etienneb edited edge metadata.
etienneb marked 2 inline comments as done.Aug 10 2016, 12:43 PM
In D23354#511569, @kcc wrote:

Wait, wait, this utterly sucks.
I need to know that we've exhausted all other possibilities before reviewing this.
Did we?

Windows memory implementation is not compatible with a static shadow address space without using ugly hacks.

The shadow address spaces currently used for win64 is working on windows 10 but not on windows 7.
On different windows version, the address space size is not the same (8 TB vs 128 TB).
see: https://msdn.microsoft.com/en-ca/library/windows/desktop/aa366778(v=vs.85).aspx

When using a static address within the 8TB, there is no way to be sure the memory space is available.
The DLLs are loaded at random location. The high-entropy randomisation is activated and DLLs are loaded everywhere in the memory space.

see: https://msdn.microsoft.com/en-us/library/jj835761.aspx

/HIGHENTROPYVA
By default, the linker sets this option for 64-bit executable images.

I was able to make it work with a static shadow (within the 8TB limit), but executables need to turn off the high-entropy, which is a hack.
Turning off high-entropy is not recommended and even if it's working I can't guaranty it will always work. We observed that DLLs where
loaded in the lower address space.

After trying to find different way to hack the code to keep a static shadow, we decided to try the dynamic one and determine if it's working
and what is the extra cost added.

The only added cost is a load at every instrumented-function entry.
There is an added loaded and a register holding the shadow address is used;
which I believe won't make a huge difference.

Benchmark will come. I wanted to share this solution to start discussion.

rnk edited edge metadata.Aug 10 2016, 1:12 PM
In D23354#511569, @kcc wrote:

Wait, wait, this utterly sucks.
I need to know that we've exhausted all other possibilities before reviewing this.
Did we?

I'm pretty confident that Etienne has done a lot of research, and we have found that on Windows, DLLs are spread out widely across the 48-bit address space. We'll be lucky if its possible to satisfy a 16TB allocation, let alone at a fixed address.

On top of that, Windows is a hostile environment where malware and AVs will come and inject DLLs into your process at unpredictable addresses. They don't stay clumped in the upper half like on Linux. We will need dynamic shadow, one way or another, if anyone ever wants to run an asan-ified executable to a machine that they don't control with any degree of reliability.

Also, as he mentioned, having a dynamic shadow makes it easier to run the same ASan binary on pre-Win8.1 machines (44-bit address space) and Win8.1+ machines (48-bit address space). We could technically use one fixed offset for both address space layouts, but the size would be dynamic, so we would still need a lot of runtime complexity.

I'm also skeptical that one load in the prologue is really that expensive. Use-after-return does a conditional branch in the prologue, and it's on by default now. Safestack does more work than that in the prologue, and your team found it had a negligible impact on performance. Anyway, Ettiene said he'd do a benchmark soon, we can how it goes.

We really did spend a while thinking about this, and I really do think we want to bring back dynamic shadow mapping.

kcc added a comment.Aug 10 2016, 1:23 PM

All I've heard so far are very sad things, but they don't convince me.
I believe we have Microsoft folks in the community now. Please summon some of them to this discussion.

We must find a way to use static shadow.
My estimate is that dynamic shadow will cost us 10% in CPU and 5% in code size at the very least.
These are just guesses, so benchmark results are more than welcome.

Comparing this change to whatever we do in the prologue is not correct since for dynamic shadow we
a) steal a register for the entire function
b) complicate the address arithmetic for every memory access.

BTW, why do we have to support all older windows versions?
For me it would be fine to only support whatever the current version is and the future ones.
At least as the first step.

In D23354#511712, @kcc wrote:

All I've heard so far are very sad things, but they don't convince me.
I believe we have Microsoft folks in the community now. Please summon some of them to this discussion.

We must find a way to use static shadow.
My estimate is that dynamic shadow will cost us 10% in CPU and 5% in code size at the very least.
These are just guesses, so benchmark results are more than welcome.

Comparing this change to whatever we do in the prologue is not correct since for dynamic shadow we
a) steal a register for the entire function
b) complicate the address arithmetic for every memory access.

BTW, why do we have to support all older windows versions?
For me it would be fine to only support whatever the current version is and the future ones.
At least as the first step.

We're not trying to support all versions of Windows. But a significant portion of our user population is still on Win7/8/8.1, and not on Win10. We'd like to support them for as long as Chrome supports those platforms.

No matter what static address you choose, there's a non-zero chance that something else will be there. Especially since all sorts of things out of control (third party code) likes to inject itself into Chrome's address space. Since we're trying to grab 12.5% of the address space, that chance is actually pretty high.

In a lab it's fine for occasional failures to start due to address space collisions. On a users machine, not so much. SyzyASAN chose a dynamic shadow for this reason. We're gearing up to ship 64-bit ASAN to the canary Windows users, and shipping a product that can randomly fail to start is simply not going to happen.

If the ability to run with a non-dynamic shadow is preserved as an instrumentation time choice, why can't we support both worlds? If you really want to tweak out the last small slice of performance in a lab scenario then feel free to use a static shadow. For shipping to users where the ability to reliably start trumps performance, then use a dynamic shadow.

kcc added a comment.Aug 10 2016, 1:55 PM

We're not trying to support all versions of Windows. But a significant portion of our user population is still on Win7/8/8.1, and not on Win10. We'd like to support them for as long as Chrome supports those platforms.

Maybe that's not the right goal?
Win7/8/8.1 tend to have slower machines and giving them asan-ified binaries *might* be a bad idea anyway.

No matter what static address you choose, there's a non-zero chance that something else will be there. Especially since all sorts of things out of control (third party code) likes to inject itself into Chrome's address space. Since we're trying to grab 12.5% of the address space, that chance is actually pretty high.

I perfectly understand that, but I have no expertise to know if there is no way to workaround this.
This is why I am asking someone from Microsoft to help.

In a lab it's fine for occasional failures to start due to address space collisions. On a users machine, not so much. SyzyASAN chose a dynamic shadow for this reason. We're gearing up to ship 64-bit ASAN to the canary Windows users, and shipping a product that can randomly fail to start is simply not going to happen.

Of course.

If the ability to run with a non-dynamic shadow is preserved as an instrumentation time choice, why can't we support both worlds?

We can. I am not convinced we've done everything to avoid it.

If you really want to tweak out the last small slice of performance in a lab scenario then feel free to use a static shadow. For shipping to users where the ability to reliably start trumps performance, then use a dynamic shadow.

On the contrary. When testing Chrome in a lab, 10% of speed is "just" money.
On a user machine it might be a difference between tolerable and intolerable UI speed.
Same for binary size.

etienneb updated this revision to Diff 67608.Aug 10 2016, 2:35 PM
etienneb edited edge metadata.

address rnk@ comment

We performed benchmarking of static and dynamic instrumentation on windows and on linux 64-bits to compare speed and code size impact.

Benchmarks used for the analysis are:

  • Chromium release build,
  • Clang debug and release build,
  • and some executables of the CPU2006 suite.

On windows 64-bits instrumented executables, we are observing a reduction in code size of about 3%. This can be explained by the fact that the shadow address is kept in a register, loaded in each function prologue. Thus, subsequent load instructions are smaller because they are not using 8-bytes for a fixed constant.

The same gains are observed when using the dynamic instrumentation on linux 64-bits, even if the shadow address constant is able to fit in 4-bytes.

Using different levels of optimisation (O0, O1 and O2) on linux 64-bits shows that there is a code size increase of about 20% on clang debug builds, but not on release builds. This loss is not observed on CPU2006 executables. By looking manually at a few functions, we figured out that functions with lots of local variables are creating high register pressure and the shadow address is not kept in a register and it is spilled on the stack, leading to an increase in function size.

On the speed aspect we are observing a loss of about 3% on windows CPU2006 benchmarks. On the other benchmarks the speed variation is lost in the noise.

Assembly code of static instrumentation:

shr    $0x3,%rax
mov    %rax,0x70(%rbx)
cmpb   $0x0,0x7fff8000(%rax)        # bytes: 80 b8 00 80 ff 7f 00 (7-bytes)
jne    591a90 <main+0x1140>

Assembly code of dynamic instrumentation:

prologue:
  lea    0x301d40(%rip),%rax        # <__asan_shadow_memory_dynamic_address>
  mov    (%rax),%rdx                # %rdx contains shadow address

[...]
Instrumentation size:
  shr    $0x3,%rax
  mov    %rax,0x78(%rbx)
  cmpb   $0x0,(%rax,%rdx,1)         # bytes: 80 3c 10 00 (4-bytes, smaller)
  jne    58ecf9 <main+0x1139>

We can observed the gain in code size on the cmpb instruction.

Chromium code size benchmark results:

CHROMIUM

linux
       base      static   dynamic     S/B    D/B   D/S
bro    000311c6	001934c6 0018fb06   821%   814%  99%
flatc  000635a6	002269f6 00220f66   554%   549%  99%
chrome 04a49f06	110865d6 10ab2686   367%   359%  98%

windows
       base      static   dynamic   S/B    D/B   D/S
bro       6A000   100000    FE000   242%   240%  99%
flatc     D6000   2BE000   2B1000   328%   322%  98%
chrome  3E5B000 10FAD000 109FA000   436%   427%  98%
c_child 3A09000  F69A000  F184000   425%   416%  98%

On chromium benchmark, the code size is always slightly improving.

CPU2006 codesize benchmark results:

CPU2026              
                  base  static dynamic  S/B   D/B    D/S
windows              
/Od 470.lbm      21000   A3000  A3000   494%  494%  100%
    482.sphinx3  55000  15A000  157000  407%  404%   99%
    401.bzip2    2C000   EE000  EC000   541%  536%   99%
              
/O1 470.lbm      1D000   97000  97000   521%  521%  100%
    482.sphinx3  38000   F3000  F2000   434%  432%  100%
    401.bzip2    1E000   AA000  AA000   567%  567%  100%
              
/O2 470.lbm      1D000   97000  97000   521%  521%  100%
    482.sphinx3  3C000  109000  107000  442%  438%   99%
    401.bzip2    22000   BC000  BC000   553%  553%  100%

On CPU2006 benchmarks, the code size is always slightly improving or about the same.

Clang code size metrics:

CLANG (Release) linux
                  base    static   dynamic  S/B   D/B   D/S  
clang-4.0     01df0c9a  0666a30a  063cebea  342%  333%  97%
opt           00d48372  02d436aa  02c3fd3a  341%  333%  98%
clang-format  000f3582  0053526a  00518e0a  548%  536%  98%
llvm-link     001a34b2  00a4030a  00a04f1a  626%  612%  98%
clang-tidy    00f682fa  0387f9ba  03700e2a  367%  357%  97%

CLANG (Debug) linux
                  base    static   dynamic  S/B   D/B   D/S  
clang-4.0     043bbd30  04fa5b80  063cebea  118%  147%  125%
opt           01aab360  02648e80  02c3fd3a  144%  166%  116%
clang-format  0047aff0  0073a340  00518e0a  161%  114%  71%
llvm-link     007618f0  00b2e8f0  00a04f1a  151%  136%  90%
clang-tidy    026478d0  02d32dd0  03700e2a  118%  144%  122%

On clang-benchmark, the code size is improving with Release builds and decreasing with Debug builds.

Impressive measurement results, thank you.
Let's go forward, Vitaly, please help review this further.

lib/Transforms/Instrumentation/AddressSanitizer.cpp
1861

Why is this here?
It looks like it belongs to another change. If so, please make a separate CL.

etienneb added inline comments.Sep 13 2016, 10:52 AM
lib/Transforms/Instrumentation/AddressSanitizer.cpp
1861

Since the code is using a dynamic shadow, I've found a flaky bug hidden below layers.
IIRC, there is a call to __asan_default_options at the beginning of __asan_init, BEFORE the shadow initialization.
To avoid the bug, users should add the appropriate attribute to avoid instrumentation of that function, which is error prone.

This is a way to avoid instrumenting these functions.
I'm not proud of that fix, and willing to other proposition.
Like, maybe only banning a fixed set.

kcc added inline comments.Sep 13 2016, 2:26 PM
lib/Transforms/Instrumentation/AddressSanitizer.cpp
1861

Sure, but please make it a separate change.
I am totally fine to just ignore all the functions that start with __asan_

vitalybuka added inline comments.Sep 14 2016, 10:18 PM
lib/Transforms/Instrumentation/AddressSanitizer.cpp
1007

I is not null, so just if (LocalDynamicShadow == I)

etienneb updated this revision to Diff 71518.Sep 15 2016, 9:47 AM
etienneb marked 4 inline comments as done.

rebased + cleanup

etienneb updated this revision to Diff 71522.Sep 15 2016, 10:33 AM

fix incorrect merge

vitalybuka accepted this revision.Sep 15 2016, 11:11 AM
vitalybuka edited edge metadata.

Please replace in title [compiler-rt] -> [asan]
LGTM

This revision is now accepted and ready to land.Sep 15 2016, 11:11 AM
etienneb retitled this revision from [compiler-rt] Support dynamic shadow address instrumentation to [asan] Support dynamic shadow address instrumentation.Sep 15 2016, 11:12 AM
etienneb edited edge metadata.

thanks, could you also review this one: https://reviews.llvm.org/D23363
Both patches are needed.

etienneb closed this revision.Sep 19 2016, 9:07 AM
etienneb reopened this revision.Sep 30 2016, 9:00 AM

Re-opening this issue.

[compiler-rt] r282096 - revert 282085, 281909, they broke 32-bit dynamic ASan and the sanitizer-windows bot

I'm gonna fix it and re-land it.

This revision is now accepted and ready to land.Sep 30 2016, 9:00 AM

When testing with this patch on the Mac, I ran into a problem using -asan-force-dynamic-shadow. Specifically, Mapping.OrShadowOffset was not inferred correctly because it relies on checking kDynamicShadowSentinel and not the ClForceDynamicShadow. How about using ClForceDynamicShadow when setting the Mapping.Offset and gating the maybeInsertDynamicShadowAtFunctionEntry solely on the value of the Mapping.Offset?

@@ -434,13 +434,19 @@ static ShadowMapping getShadowMapping(Triple &TargetTriple, int LongSize,
...
     else
       Mapping.Offset = kDefaultShadowOffset64;
   }
 
+  if (ClForceDynamicShadow) {
+    Mapping.Offset = kDynamicShadowSentinel;
+  }
+
   Mapping.Scale = kDefaultShadowScale;
   if (ClMappingScale.getNumOccurrences() > 0) {
     Mapping.Scale = ClMappingScale;
@@ -1802,7 +1808,7 @@ bool AddressSanitizer::maybeInsertAsanInitAtFunctionEntry(Function &F) {
 
 void AddressSanitizer::maybeInsertDynamicShadowAtFunctionEntry(Function &F) {
   // Generate code only when dynamic addressing is needed.
-  if (!ClForceDynamicShadow && Mapping.Offset != kDynamicShadowSentinel)
+  if (Mapping.Offset != kDynamicShadowSentinel)
     return;
 
   IRBuilder<> IRB(&F.front().front());

If you'd like, I can commit this as a separate patch.

etienneb updated this revision to Diff 73074.Sep 30 2016, 9:42 AM

apply anna's patch.

Thanks Anna. I modified this patch to include your changes.

If you have some cycles, can you run the unittests with both patches on iOS.
I can't test on that platform.

Tests are fine on windows 32/64 and linux 64.

When testing with this patch on the Mac, I ran into a problem using -asan-force-dynamic-shadow. Specifically, Mapping.OrShadowOffset was not inferred correctly because it relies on checking kDynamicShadowSentinel and not the ClForceDynamicShadow. How about using ClForceDynamicShadow when setting the Mapping.Offset and gating the maybeInsertDynamicShadowAtFunctionEntry solely on the value of the Mapping.Offset?

@@ -434,13 +434,19 @@ static ShadowMapping getShadowMapping(Triple &TargetTriple, int LongSize,
...
     else
       Mapping.Offset = kDefaultShadowOffset64;
   }
 
+  if (ClForceDynamicShadow) {
+    Mapping.Offset = kDynamicShadowSentinel;
+  }
+
   Mapping.Scale = kDefaultShadowScale;
   if (ClMappingScale.getNumOccurrences() > 0) {
     Mapping.Scale = ClMappingScale;
@@ -1802,7 +1808,7 @@ bool AddressSanitizer::maybeInsertAsanInitAtFunctionEntry(Function &F) {
 
 void AddressSanitizer::maybeInsertDynamicShadowAtFunctionEntry(Function &F) {
   // Generate code only when dynamic addressing is needed.
-  if (!ClForceDynamicShadow && Mapping.Offset != kDynamicShadowSentinel)
+  if (Mapping.Offset != kDynamicShadowSentinel)
     return;
 
   IRBuilder<> IRB(&F.front().front());

If you'd like, I can commit this as a separate patch.

etienneb updated this revision to Diff 73096.Sep 30 2016, 10:54 AM

fix rnk@ comments

etienneb closed this revision.Sep 30 2016, 10:55 AM

I've already tested this on iOS and Mac after applying additional patches to compiler-rt to get the FindAvailableMemoryRange working. I did not run into any regressions. Hope the commits stick this time around!