This is an archive of the discontinued LLVM Phabricator instance.

[scudo] Don't track free/use stats for transfer batches.
ClosedPublic

Authored by hctim on Apr 30 2021, 11:01 AM.

Details

Summary

The Scudo C unit tests are currently non-hermetic. In particular, adding
or removing a transfer batch is a global state of the allocator that
persists between tests. This can cause flakiness in
ScudoWrappersCTest.MallInfo, because the creation or teardown of a batch
causes mallinfo's uordblks or fordblks to move up or down by the size of
a transfer batch on malloc/free.

It's my opinion that uordblks and fordblks should track the statistics
related to the user's malloc() and free() usage, and not the state of
the internal allocator structures. Thus, excluding the transfer batches
from stat collection does the trick and makes these tests pass.

Repro instructions of the bug:

  1. ninja ./projects/compiler-rt/lib/scudo/standalone/tests/ScudoCUnitTest-x86_64-Test
  2. ./projects/compiler-rt/lib/scudo/standalone/tests/ScudoCUnitTest-x86_64-Test --gtest_filter=ScudoWrappersCTest.MallInfo

Diff Detail

Event Timeline

hctim created this revision.Apr 30 2021, 11:01 AM
hctim requested review of this revision.Apr 30 2021, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2021, 11:01 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
cryptoad accepted this revision.Apr 30 2021, 11:07 AM

Right, it makes sense to not account for the batches. I don't think this should have a meaningful impact, but have you noticed any perf changes adding this since it's in the fastpath?

This revision is now accepted and ready to land.Apr 30 2021, 11:07 AM

Right, it makes sense to not account for the batches. I don't think this should have a meaningful impact, but have you noticed any perf changes adding this since it's in the fastpath?

Benchmark                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------
BM_malloc_free<scudo::AndroidConfig>/8                               -0.0330         -0.0329            62            60            62            60
BM_malloc_free<scudo::AndroidConfig>/64                              +0.0290         +0.0290            57            59            57            59
BM_malloc_free<scudo::AndroidConfig>/512                             +0.0286         +0.0286            58            59            58            59
BM_malloc_free<scudo::AndroidConfig>/4096                            +0.0287         +0.0288            58            59            58            59
BM_malloc_free<scudo::AndroidConfig>/32768                           -0.0022         -0.0023            61            61            61            61
BM_malloc_free<scudo::AndroidConfig>/131072                          +0.0078         +0.0079           209           211           209           211
BM_malloc_free<scudo::AndroidSvelteConfig>/8                         +0.0371         +0.0370            58            60            58            60
BM_malloc_free<scudo::AndroidSvelteConfig>/64                        +0.0386         +0.0386            56            59            56            59
BM_malloc_free<scudo::AndroidSvelteConfig>/512                       +0.0382         +0.0381            57            59            57            59
BM_malloc_free<scudo::AndroidSvelteConfig>/4096                      +0.0379         +0.0379            57            59            57            59
BM_malloc_free<scudo::AndroidSvelteConfig>/32768                     +0.0042         +0.0043         12378         12430         12377         12430
BM_malloc_free<scudo::AndroidSvelteConfig>/131072                    +0.0055         +0.0054         43211         43447         43211         43445
BM_malloc_free<scudo::FuchsiaConfig>/8                               +0.0085         +0.0085            60            61            60            61
BM_malloc_free<scudo::FuchsiaConfig>/64                              +0.0090         +0.0090            59            59            59            59
BM_malloc_free<scudo::FuchsiaConfig>/512                             +0.0083         +0.0083            59            60            59            60
BM_malloc_free<scudo::FuchsiaConfig>/4096                            +0.0085         +0.0085            59            60            59            60
BM_malloc_free<scudo::FuchsiaConfig>/32768                           +0.0273         +0.0271            61            63            61            62
BM_malloc_free<scudo::FuchsiaConfig>/131072                          +0.0306         +0.0307           129           133           129           133
BM_malloc_free_loop<scudo::AndroidConfig>/8                          -0.0053         -0.0053           475           472           475           472
BM_malloc_free_loop<scudo::AndroidConfig>/64                         +0.0039         +0.0039          4344          4361          4344          4361
BM_malloc_free_loop<scudo::AndroidConfig>/512                        +0.0009         +0.0010         40413         40448         40405         40445
BM_malloc_free_loop<scudo::AndroidConfig>/4096                       -0.0016         -0.0015        341311        340752        341237        340721
BM_malloc_free_loop<scudo::AndroidConfig>/32768                      +0.0020         +0.0020       3740279       3747592       3739572       3746891
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/8                    +0.0096         +0.0097           469           473           468           473
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/64                   +0.0003         +0.0003         54077         54091         54071         54088
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/512                  -0.0030         -0.0030        546428        544806        546389        544774
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/4096                 -0.0040         -0.0042       4849805       4830306       4849735       4829560
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/32768                +0.0046         +0.0046      58455483      58726574      58450885      58718062
BM_malloc_free_loop<scudo::FuchsiaConfig>/8                          +0.0075         +0.0075           474           477           474           477
BM_malloc_free_loop<scudo::FuchsiaConfig>/64                         -0.0002         -0.0002          4787          4786          4786          4785
BM_malloc_free_loop<scudo::FuchsiaConfig>/512                        +0.0082         +0.0083         46483         46863         46477         46862
BM_malloc_free_loop<scudo::FuchsiaConfig>/4096                       +0.0078         +0.0080        393932        397009        393828        396963
BM_malloc_free_loop<scudo::FuchsiaConfig>/32768                      -0.0071         -0.0071       4275339       4244975       4275055       4244679

Looks well within the noise, at least on Linux workstation, which is what I'd expect for a immediate vs. register conditional branch.

hctim added a comment.Apr 30 2021, 4:47 PM

I originally thought the +X wasn't a %% difference. I looked into one of the worst-looking benchmarks that showed a 3% increase.

Looks like LIKELY() macros doesn't make a difference here.

old vs. new

Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
BM_malloc_free<scudo::AndroidSvelteConfig>/64_pvalue                 0.0000          0.0000      U Test, Repetitions: 50 vs 50
BM_malloc_free<scudo::AndroidSvelteConfig>/64_mean                  +0.0518         +0.0519            56            59            56            59
BM_malloc_free<scudo::AndroidSvelteConfig>/64_median                +0.0385         +0.0385            56            59            56            59
BM_malloc_free<scudo::AndroidSvelteConfig>/64_stddev                +0.9418         +0.9467             1             1             1             1

old vs. new with LIKELY()

Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
BM_malloc_free<scudo::AndroidSvelteConfig>/64_pvalue                 0.0000          0.0000      U Test, Repetitions: 50 vs 50
BM_malloc_free<scudo::AndroidSvelteConfig>/64_mean                  +0.0377         +0.0378            56            59            56            59
BM_malloc_free<scudo::AndroidSvelteConfig>/64_median                +0.0383         +0.0383            56            59            56            59
BM_malloc_free<scudo::AndroidSvelteConfig>/64_stddev                +0.0897         +0.0903             1             1             1             1

new with LIKELY() vs. old (litmus check)

Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
BM_malloc_free<scudo::AndroidSvelteConfig>/64_pvalue                 0.0000          0.0000      U Test, Repetitions: 50 vs 50
BM_malloc_free<scudo::AndroidSvelteConfig>/64_mean                  -0.0139         -0.0139            58            58            58            58
BM_malloc_free<scudo::AndroidSvelteConfig>/64_median                -0.0369         -0.0369            59            56            59            56
BM_malloc_free<scudo::AndroidSvelteConfig>/64_stddev                +9.2978         +9.2962             0             3             0             3

new vs. new with LIKELY()

Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
BM_malloc_free<scudo::AndroidSvelteConfig>/64_pvalue                 0.1285          0.1217      U Test, Repetitions: 50 vs 50
BM_malloc_free<scudo::AndroidSvelteConfig>/64_mean                  -0.0014         -0.0014            59            58            59            58
BM_malloc_free<scudo::AndroidSvelteConfig>/64_median                -0.0002         -0.0001            59            59            59            59
BM_malloc_free<scudo::AndroidSvelteConfig>/64_stddev                -0.8267         -0.8264             1             0             1             0

I rewrote the patch to make it branchless, to similar results:

diff --git a/compiler-rt/lib/scudo/standalone/local_cache.h b/compiler-rt/lib/scudo/standalone/local_cache.h
index 509221f4d619..1b0f07fc84a1 100644
--- a/compiler-rt/lib/scudo/standalone/local_cache.h
+++ b/compiler-rt/lib/scudo/standalone/local_cache.h
@@ -67,6 +67,17 @@ template <class SizeClassAllocator> struct SizeClassAllocatorLocalCache {
       S->unlink(&Stats);
   }
 
+  void* allocateTransferBatch() {
+    PerClass *C = &PerClassArray[BatchClassId];
+    if (C->Count == 0) {
+      if (UNLIKELY(!refill(C, BatchClassId)))
+        return nullptr;
+      DCHECK_GT(C->Count, 0);
+    }
+    CompactPtrT CompactP = C->Chunks[--C->Count];
+    return Allocator->decompactPtr(BatchClassId, CompactP);
+  }
+
   void *allocate(uptr ClassId) {
     DCHECK_LT(ClassId, NumClasses);
     PerClass *C = &PerClassArray[ClassId];
@@ -85,6 +96,14 @@ template <class SizeClassAllocator> struct SizeClassAllocatorLocalCache {
     return Allocator->decompactPtr(ClassId, CompactP);
   }
 
+  void deallocateBatch(void *P) {
+    PerClass *C = &PerClassArray[BatchClassId];
+    if (C->Count == C->MaxCount)
+      drain(C, BatchClassId);
+    C->Chunks[C->Count++] =
+        Allocator->compactPtr(BatchClassId, reinterpret_cast<uptr>(P));
+  }
+
   void deallocate(uptr ClassId, void *P) {
     CHECK_LT(ClassId, NumClasses);
     PerClass *C = &PerClassArray[ClassId];
@@ -123,7 +142,7 @@ template <class SizeClassAllocator> struct SizeClassAllocatorLocalCache {
 
   TransferBatch *createBatch(uptr ClassId, void *B) {
     if (ClassId != BatchClassId)
-      B = allocate(BatchClassId);
+      B = allocateTransferBatch();
     return reinterpret_cast<TransferBatch *>(B);
   }
 
@@ -160,7 +179,7 @@ private:
 
   void destroyBatch(uptr ClassId, void *B) {
     if (ClassId != BatchClassId)
-      deallocate(BatchClassId, B);
+      deallocateBatch(B);
   }
 
   NOINLINE bool refill(PerClass *C, uptr ClassId) {
BM_malloc_free<scudo::AndroidSvelteConfig>/64_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
BM_malloc_free<scudo::AndroidSvelteConfig>/64_mean                  +0.0386         +0.0385            56            58            56            58
BM_malloc_free<scudo::AndroidSvelteConfig>/64_median                +0.0386         +0.0386            56            58            56            58
BM_malloc_free<scudo::AndroidSvelteConfig>/64_stddev                -0.1326         -0.1297             0             0             0             0

Then, I just added a single nop and saw the same results:

diff --git a/compiler-rt/lib/scudo/standalone/local_cache.h b/compiler-rt/lib/scudo/standalone/local_cache.h
index 509221f4d619..0b7f0e579443 100644
--- a/compiler-rt/lib/scudo/standalone/local_cache.h
+++ b/compiler-rt/lib/scudo/standalone/local_cache.h
@@ -80,6 +80,7 @@ template <class SizeClassAllocator> struct SizeClassAllocatorLocalCache {
     // the memory accesses in close quarters.
     const uptr ClassSize = C->ClassSize;
     CompactPtrT CompactP = C->Chunks[--C->Count];
+    asm volatile ("nop\n");
     Stats.add(StatAllocated, ClassSize);
     Stats.sub(StatFree, ClassSize);
     return Allocator->decompactPtr(ClassId, CompactP);
BM_malloc_free<scudo::AndroidSvelteConfig>/64_pvalue                 0.0000          0.0000      U Test, Repetitions: 20 vs 20
BM_malloc_free<scudo::AndroidSvelteConfig>/64_mean                  +0.0375         +0.0375            57            59            57            59
BM_malloc_free<scudo::AndroidSvelteConfig>/64_median                +0.0384         +0.0383            56            59            56            59
BM_malloc_free<scudo::AndroidSvelteConfig>/64_stddev                -0.0177         -0.0188             1             1             1             1
hctim updated this revision to Diff 342097.Apr 30 2021, 5:40 PM

Instead of introducing a branch to the fast path, just undo the stats changes in the slow path. allocate()/deallocate() are inlined everywhere by the compiler (which is a correct optimisation, no inlining causes a 10% performance penalty). By adding extra instructions, we drastically diminish icache performance by a few percent.

So, instead, just undo the stats changes in the slowpath for the batch classes. Ugly, but performant.

hctim updated this revision to Diff 342099.Apr 30 2021, 5:54 PM

Looks much better now:

BM_malloc_free<scudo::AndroidConfig>/8_pvalue                                0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidConfig>/8_median                               +0.0006         +0.0006            59            59            59            59
BM_malloc_free<scudo::AndroidConfig>/64_pvalue                               0.2730          0.2730      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidConfig>/64_median                              -0.0001         -0.0001            57            57            57            57
BM_malloc_free<scudo::AndroidConfig>/512_pvalue                              0.0890          0.0539      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidConfig>/512_median                             +0.0001         +0.0001            58            58            58            58
BM_malloc_free<scudo::AndroidConfig>/4096_pvalue                             0.3847          0.5205      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidConfig>/4096_median                            +0.0001         +0.0001            58            58            58            58
BM_malloc_free<scudo::AndroidConfig>/32768_pvalue                            0.2123          0.2123      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidConfig>/32768_median                           +0.0006         +0.0006            61            61            61            61
BM_malloc_free<scudo::AndroidConfig>/131072_pvalue                           0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidConfig>/131072_median                          -0.0275         -0.0275           226           220           226           220
BM_malloc_free<scudo::AndroidSvelteConfig>/8_pvalue                          0.0091          0.0028      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidSvelteConfig>/8_median                         -0.0002         -0.0003            58            58            58            58
BM_malloc_free<scudo::AndroidSvelteConfig>/64_pvalue                         0.7337          0.1212      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidSvelteConfig>/64_median                        -0.0000         -0.0000            56            56            56            56
BM_malloc_free<scudo::AndroidSvelteConfig>/512_pvalue                        0.0091          0.0257      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidSvelteConfig>/512_median                       +0.0001         +0.0001            57            57            57            57
BM_malloc_free<scudo::AndroidSvelteConfig>/4096_pvalue                       0.6232          0.8501      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidSvelteConfig>/4096_median                      +0.0001         +0.0000            57            57            57            57
BM_malloc_free<scudo::AndroidSvelteConfig>/32768_pvalue                      0.0312          0.0312      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidSvelteConfig>/32768_median                     -0.0057         -0.0057         12239         12170         12239         12169
BM_malloc_free<scudo::AndroidSvelteConfig>/131072_pvalue                     0.0073          0.0073      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::AndroidSvelteConfig>/131072_median                    -0.0047         -0.0046         42651         42451         42644         42446
BM_malloc_free<scudo::FuchsiaConfig>/8_pvalue                                0.0257          0.0211      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::FuchsiaConfig>/8_median                               -0.0001         -0.0001            60            60            60            60
BM_malloc_free<scudo::FuchsiaConfig>/64_pvalue                               0.0890          0.1405      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::FuchsiaConfig>/64_median                              +0.0001         +0.0001            59            59            59            59
BM_malloc_free<scudo::FuchsiaConfig>/512_pvalue                              0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::FuchsiaConfig>/512_median                             -0.0003         -0.0003            59            59            59            59
BM_malloc_free<scudo::FuchsiaConfig>/4096_pvalue                             0.2413          0.1405      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::FuchsiaConfig>/4096_median                            -0.0001         -0.0002            59            59            59            59
BM_malloc_free<scudo::FuchsiaConfig>/32768_pvalue                            0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::FuchsiaConfig>/32768_median                           -0.0474         -0.0474            64            61            64            61
BM_malloc_free<scudo::FuchsiaConfig>/131072_pvalue                           0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free<scudo::FuchsiaConfig>/131072_median                          -0.1056         -0.1056           145           130           145           130
BM_malloc_free_loop<scudo::AndroidConfig>/8_pvalue                           0.0312          0.0257      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidConfig>/8_median                          +0.0003         +0.0002           474           475           474           475
BM_malloc_free_loop<scudo::AndroidConfig>/64_pvalue                          0.6232          0.6232      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidConfig>/64_median                         -0.0007         -0.0007          4350          4347          4350          4347
BM_malloc_free_loop<scudo::AndroidConfig>/512_pvalue                         0.0022          0.0022      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidConfig>/512_median                        +0.0131         +0.0131         40291         40818         40288         40815
BM_malloc_free_loop<scudo::AndroidConfig>/4096_pvalue                        0.0022          0.0022      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidConfig>/4096_median                       +0.0117         +0.0117        339975        343953        339938        343904
BM_malloc_free_loop<scudo::AndroidConfig>/32768_pvalue                       0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidConfig>/32768_median                      +0.0352         +0.0353       3688635       3818622       3688268       3818358
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/8_pvalue                     0.0113          0.0113      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/8_median                    +0.0005         +0.0005           468           469           468           469
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/64_pvalue                    0.0890          0.0890      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/64_median                   -0.0036         -0.0036         53396         53202         53393         53199
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/512_pvalue                   0.0211          0.0211      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/512_median                  +0.0019         +0.0019        544097        545154        544064        545107
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/4096_pvalue                  0.0006          0.0003      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/4096_median                 -0.0047         -0.0046       4833793       4810864       4832688       4810374
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/32768_pvalue                 0.0058          0.0058      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::AndroidSvelteConfig>/32768_median                +0.0028         +0.0027      59100441      59263696      59097808      59257831
BM_malloc_free_loop<scudo::FuchsiaConfig>/8_pvalue                           0.2413          0.1620      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::FuchsiaConfig>/8_median                          +0.0000         +0.0000           473           473           472           473
BM_malloc_free_loop<scudo::FuchsiaConfig>/64_pvalue                          0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::FuchsiaConfig>/64_median                         +0.0054         +0.0055          4771          4796          4770          4796
BM_malloc_free_loop<scudo::FuchsiaConfig>/512_pvalue                         0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::FuchsiaConfig>/512_median                        +0.0049         +0.0048         46510         46737         46508         46733
BM_malloc_free_loop<scudo::FuchsiaConfig>/4096_pvalue                        0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::FuchsiaConfig>/4096_median                       +0.0079         +0.0077        392717        395811        392685        395726
BM_malloc_free_loop<scudo::FuchsiaConfig>/32768_pvalue                       0.0002          0.0002      U Test, Repetitions: 10 vs 10
BM_malloc_free_loop<scudo::FuchsiaConfig>/32768_median                      +0.0091         +0.0091       4243219       4282005       4242325       4280959

ClassSize is only used for the stats in the local cache, what about initializing it to 0 for BatchClassId in initCache?

hctim updated this revision to Diff 342474.May 3 2021, 11:15 AM

Integrate Kostya's excellent suggestion.

cryptoad accepted this revision.May 3 2021, 11:17 AM
This revision was landed with ongoing or failed builds.May 3 2021, 11:50 AM
This revision was automatically updated to reflect the committed changes.