Page MenuHomePhabricator

[analyzer] Reorder the layout of MemRegion and cache by hand for optimal size
AcceptedPublic

Authored by steakhal on Aug 20 2020, 8:19 AM.

Details

Summary

By the time working on the gdb pretty-printers for the Analyzer classes,
I observed that the MemRegion class has an inefficient memory layout.

Each and every object which's has a MemRegion as a subobject would benefit from this change.
We could spare 8 bytes for each case.

As a downside, I had to substitute the llvm::Optional, since that alone consumes 16 bytes (even if the wrapped object would consume 8 bytes :| ).

Before the patch: sizeof(MemRegion) == 48

(gdb) ptype /o class MemRegion
/* offset    |  size */  type = class clang::ento::MemRegion : public llvm::FoldingSetBase::Node {
                         private:
/*   16      |     4 */    const clang::ento::MemRegion::Kind kind;
/* XXX  4-byte hole  */
/*   24      |    24 */    class llvm::Optional<clang::ento::RegionOffset> [with T = clang::ento::RegionOffset] {
                             private:
/*   24      |    24 */        class llvm::optional_detail::OptionalStorage<T, true> [with T = T] {
                                 private:
/*   24      |    16 */            union {
/*                 1 */                char empty;
/*                16 */                T value;

                                       /* total size (bytes):   16 */
                                   };
/*   40      |     1 */            bool hasVal;

                                   /* total size (bytes):   24 */
                               } Storage;

                               /* total size (bytes):   24 */
                           } cachedOffset;

                           /* total size (bytes):   48 */
                         }

After the patch: sizeof(MemRegion) == 40

(gdb) ptype /o MemRegion
/* offset    |  size */  type = class clang::ento::MemRegion : public llvm::FoldingSetBase::Node {
                         private:
/*   16      |    16 */    class clang::ento::RegionOffset {
                             private:
/*   16      |     8 */        const clang::ento::MemRegion *R;
/*   24      |     8 */        int64_t Offset;
                             public:
                               static const int64_t Symbolic;

                               /* total size (bytes):   16 */
                           } cachedOffset;
/*   32      |     4 */    const clang::ento::MemRegion::Kind kind;
/*   36      |     1 */    bool hasCachedOffset;

                           /* total size (bytes):   40 */
                         }

Diff Detail

Event Timeline

steakhal created this revision.Aug 20 2020, 8:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2020, 8:19 AM
steakhal requested review of this revision.Aug 20 2020, 8:19 AM
steakhal retitled this revision from [analyzer] Reorder the the layout of MemRegion and cache by hand for optimal size to [analyzer] Reorder the layout of MemRegion and cache by hand for optimal size.

MemRegion is a popular class to instantiate in the analyzer so it looks good to me.
But unless you add a comment one will probably replace the offset with an optional as the part of a refactoring.

steakhal added a comment.EditedAug 20 2020, 10:11 AM

But unless you add a comment one will probably replace the offset with an optional as the part of a refactoring.

Sure, I will leave a note.

NoQ added a comment.Aug 20 2020, 7:40 PM

Heh, nice! Did you try to measure the actual impact of this change on memory and/or performance?

I think it'd make perfect sense to keep the offset in a side map. We don't compute it for all regions, and for most regions it doesn't need to be computed *at all*.

In D86295#2229712, @NoQ wrote:

Heh, nice! Did you try to measure the actual impact of this change on memory and/or performance?

Eh, I don't really know how to bench this.
Here is how I did it in nutshell:
Added STATISTIC counters to MemRegion ctor, dtor, getAsOffset begining and the path of just returning the cached value.

I have analyzed a faily big (50k+ LOC) TU of the llvm repository (X86ISelLowering.cpp).
I was using the following command:

/home/myuser/git/llvm-project/build/bin/clang --analyze -Qunused-arguments -Xclang -analyzer-opt-analyze-headers -Xclang -analyzer-output=text -Xclang -analyzer-config -Xclang expand-macros=true -Xclang -analyzer-checker=core -Xclang -analyzer-config -Xclang aggressive-binary-operation-simplification=true  -x c++ --target=x86_64-linux-gnu -std=gnu++14 -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/myuser/git/llvm-project/build/lib/Target/X86 -I/home/myuser/git/llvm-project/llvm/lib/Target/X86 -I/home/myuser/git/llvm-project/build/include -I/home/myuser/git/llvm-project/llvm/include -fPIC -fvisibility-inlines-hidden -Wno-unused-parameter -Wno-missing-field-initializers -pedantic -Wno-long-long -Wno-maybe-uninitialized -Wno-noexcept-type -Wno-comment -fdiagnostics-color -fPIC -fno-exceptions -fno-rtti -std=c++14 -isystem /usr/include/c++/7 -isystem /usr/include/x86_64-linux-gnu/c++/7 -isystem /usr/include/c++/7/backward -isystem /usr/local/include -isystem /usr/include/x86_64-linux-gnu -isystem /usr/include -Xclang -analyzer-stats /home/myuser/git/llvm-project/llvm/lib/Target/X86/X86ISelLowering.cpp

The relevant part of the log is:

 1051489 MemRegion        - The # of MemRegion objects were alive at a time during analysis
48603499 MemRegion        - The # of times the MemRegion::getAsOffset was called and the result was already cached
49258796 MemRegion        - The # of times the MemRegion::getAsOffset was called
 1051489 MemRegion        - The # of MemRegion objects alive at the given moment
 1051489 MemRegion        - The # of MemRegion objects created during the analysis

I think it'd make perfect sense to keep the offset in a side map. We don't compute it for all regions, and for most regions it doesn't need to be computed *at all*.

It seems that at least with this configuration the MemRegion::getAsOffset was called every single time.

Please have a look at the patch file, and the results - I might messed something up :s

NoQ added a comment.Aug 21 2020, 11:58 PM
In D86295#2229712, @NoQ wrote:

Heh, nice! Did you try to measure the actual impact of this change on memory and/or performance?

Eh, I don't really know how to bench this.

I mean, like, you can measure the entire process with time or something like that. I believe @vsavchenko's docker thingy already knows how to do that.

It seems that at least with this configuration the MemRegion::getAsOffset was called every single time.

What i'm trying to say is that for almost every region R it's a trivial O(1) operation that yields RegionOffset (R, +0). It is only non-trivial for non-base regions like FieldRegion or ElementRegion or, well, CXXBaseObjectRegion.

In D86295#2231760, @NoQ wrote:

I mean, like, you can measure the entire process with time or something like that. I believe @vsavchenko's docker thingy already knows how to do that.

I tried to measure this with time, without much luck:

What i'm trying to say is that for almost every region R it's a trivial O(1) operation that yields RegionOffset (R, +0). It is only non-trivial for non-base regions like FieldRegion or ElementRegion or, well, CXXBaseObjectRegion.

What should be my next step?

In D86295#2231760, @NoQ wrote:

I mean, like, you can measure the entire process with time or something like that. I believe @vsavchenko's docker thingy already knows how to do that.

I tried to measure this with time, without much luck:

What i'm trying to say is that for almost every region R it's a trivial O(1) operation that yields RegionOffset (R, +0). It is only non-trivial for non-base regions like FieldRegion or ElementRegion or, well, CXXBaseObjectRegion.

What should be my next step?

Can we see any improvements in CPU counters (Linux)?

# Detailed CPU counter statistics (includes extras) for the specified command:
perf stat -d command
In D86295#2231760, @NoQ wrote:

I mean, like, you can measure the entire process with time or something like that. I believe @vsavchenko's docker thingy already knows how to do that.

I tried to measure this with time, without much luck:

What i'm trying to say is that for almost every region R it's a trivial O(1) operation that yields RegionOffset (R, +0). It is only non-trivial for non-base regions like FieldRegion or ElementRegion or, well, CXXBaseObjectRegion.

What should be my next step?

Can we see any improvements in CPU counters (Linux)?

# Detailed CPU counter statistics (includes extras) for the specified command:
perf stat -d command

I am hoping to see a decrease in L1-dcache-load-misses ...

In D86295#2231760, @NoQ wrote:

I believe @vsavchenko's docker thingy already knows how to do that.

Yep, it sure does! Additionally, there is a benchmark subcommand that can chart memory consumption for measured projects.

In D86295#2231760, @NoQ wrote:

I believe @vsavchenko's docker thingy already knows how to do that.

Yep, it sure does! Additionally, there is a benchmark subcommand that can chart memory consumption for measured projects.

Could you point me to some docs to help getting started? I haven't used this docker image.

In D86295#2231760, @NoQ wrote:

I believe @vsavchenko's docker thingy already knows how to do that.

Yep, it sure does! Additionally, there is a benchmark subcommand that can chart memory consumption for measured projects.

Could you point me to some docs to help getting started? I haven't used this docker image.

I procrastinated the part about docs 😅 I will add those in the nearest future.

Here is a short summary how to do regression testing (check that all warnings are the same):

  1. cd clang/utils/analyzer
  2. Build test system's docker image (I assume that docker is installed): ./SATest.py docker --build-image
  3. Checkout commit that you want to compare against: git checkout origin/master or git checkout HEAD^1
  4. Build LLVM in docker for Ubuntu (if you are on Linux you might not it): ./SATest.py docker --clang-dir {PATH_WHERE_IT_WILL_BE_INSTALLED} --build-dir {PATH_FOR_BUILD_DIR} -- --build-llvm-only. I usually put --build-dir next to my native build directories, ie ${LLVM_DIR}/build/Docker
  5. Gather reference results: ./SATest.py docker --clang-dir {PATH_WHERE_LLVM_IS_INSTALLED} --build-dir {PATH_FOR_BUILD_DIR} -- build -j1 -r
  6. Checkout your commit that you want to test: Checkout commit that you want to compare against: git checkout master or similar
  7. Build LLVM with your patch: ./SATest.py docker --clang-dir {PATH_WHERE_IT_WILL_BE_INSTALLED} --build-dir {PATH_FOR_BUILD_DIR} -- --build-llvm-only. If you choose the same --build-dir it will get compiled incrementally and save you a lot of time. TIP: you can choose another directory for --clang-dir so you don't need to rebuild previous version if you have to re-run it.
  8. Gather new results and compare: ./SATest.py docker --clang-dir {PATH_WHERE_LLVM_IS_INSTALLED} --build-dir {PATH_FOR_BUILD_DIR} -- build -j1 --strictness 2. NOTE: SATest.py docker ... -- build command accepts a comma-separated list of projects for the analysis, so you can choose which projects to test.

Benchmarking is very similar in it's nature, run original, run modified, compare:

  1. ./SATest.py docker --clang-dir {PATH_WHERE_LLVM_IS_INSTALLED} --build-dir {PATH_FOR_BUILD_DIR} -- benchmark -o old.csv
  2. ...
  3. ...
  4. ./SATest.py docker --clang-dir {PATH_WHERE_LLVM_IS_INSTALLED} --build-dir {PATH_FOR_BUILD_DIR} -- benchmark -o new.csv
  5. ./SATest.py docker --clang-dir {PATH_WHERE_LLVM_IS_INSTALLED} --build-dir {PATH_FOR_BUILD_DIR} -- benchmark compare --old old.csv --new new.csv -o comparison.png

Benchmarks can take a while (they run analysis on the same projects multiple times), so it is recommended to use it for smaller projects. I will merge D83942 and after it you can change commands this way: ./SATest.py docker ... benchmark --max-size small -o blabla.csv

If you want to see --help for a subcommand, drop docker and all the options and simply run ./SATest.py build --help or ./SATest.py benchmark --help.

I really like the patch, but have nothing to add to what other reviewers already said. Nice!

Here is a short summary how to do regression testing (check that all warnings are the same):

Oh thanks for the detailed guide, I will make the experiment.

However the ./SATest.py docker --build-image seems broken.

E: Version '9.0.1-2.3~ubuntu1.18.04.1' for 'python3-pip' was not found
The command '/bin/sh -c apt-get update && apt-get install -y     git=1:2.17.1-1ubuntu0.7     gettext=0.19.8.1-6ubuntu0.3     python3=3.6.7-1~18.04     python3-pip=9.0.1-2.3~ubuntu1.18.04.1     cmake=3.17.3-0kitware1     ninja-build=1.8.2-1' returned a non-zero code: 100

In the DockerFile it should be python3-pip=9.0.1-2.3~ubuntu1.18.04.2 instead.
Shouldn't we anchor it to a specific version of bionic in the FROM clause?

I really like the patch, but have nothing to add to what other reviewers already said. Nice!

Thanks:)

Here is a short summary how to do regression testing (check that all warnings are the same):

Oh thanks for the detailed guide, I will make the experiment.

However the ./SATest.py docker --build-image seems broken.

E: Version '9.0.1-2.3~ubuntu1.18.04.1' for 'python3-pip' was not found
The command '/bin/sh -c apt-get update && apt-get install -y     git=1:2.17.1-1ubuntu0.7     gettext=0.19.8.1-6ubuntu0.3     python3=3.6.7-1~18.04     python3-pip=9.0.1-2.3~ubuntu1.18.04.1     cmake=3.17.3-0kitware1     ninja-build=1.8.2-1' returned a non-zero code: 100

In the DockerFile it should be python3-pip=9.0.1-2.3~ubuntu1.18.04.2 instead.
Shouldn't we anchor it to a specific version of bionic in the FROM clause?

Yep, I guess that is the cause. I'll take a look. Did you try it with this fast fix?

Yep, I guess that is the cause. I'll take a look. Did you try it with this fast fix?

I tried, but it lacks further fixes.

Currently, I have this:

diff --git a/clang/utils/analyzer/Dockerfile b/clang/utils/analyzer/Dockerfile
index f74ff8aa95c..7727d92f98f 100644
--- a/clang/utils/analyzer/Dockerfile
+++ b/clang/utils/analyzer/Dockerfile
@@ -1,4 +1,4 @@
-FROM ubuntu:bionic
+FROM ubuntu:bionic-20200526
 
 RUN apt-get update && apt-get install -y \
     apt-transport-https \
@@ -12,12 +12,13 @@ RUN wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/nul
 RUN apt-add-repository -y 'deb https://apt.kitware.com/ubuntu/ bionic main'
 
 # test system dependencies
+# TODO: should we really depend on the cmake version 3.18.2 here? I don't think so.
 RUN apt-get update && apt-get install -y \
     git=1:2.17.1-1ubuntu0.7 \
     gettext=0.19.8.1-6ubuntu0.3 \
     python3=3.6.7-1~18.04 \
-    python3-pip=9.0.1-2.3~ubuntu1.18.04.1 \
-    cmake=3.17.3-0kitware1 \
+    python3-pip=9.0.1-2.3~ubuntu1.18.04.2 \
+    cmake=3.18.2-0kitware1 \
     ninja-build=1.8.2-1
 
 # box2d dependencies

I'm not sure if its the company firewall or something else.

Yep, I guess that is the cause. I'll take a look. Did you try it with this fast fix?

I tried, but it lacks further fixes.

Currently, I have this:

diff --git a/clang/utils/analyzer/Dockerfile b/clang/utils/analyzer/Dockerfile
index f74ff8aa95c..7727d92f98f 100644
--- a/clang/utils/analyzer/Dockerfile
+++ b/clang/utils/analyzer/Dockerfile
@@ -1,4 +1,4 @@
-FROM ubuntu:bionic
+FROM ubuntu:bionic-20200526
 
 RUN apt-get update && apt-get install -y \
     apt-transport-https \
@@ -12,12 +12,13 @@ RUN wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/nul
 RUN apt-add-repository -y 'deb https://apt.kitware.com/ubuntu/ bionic main'
 
 # test system dependencies
+# TODO: should we really depend on the cmake version 3.18.2 here? I don't think so.
 RUN apt-get update && apt-get install -y \
     git=1:2.17.1-1ubuntu0.7 \
     gettext=0.19.8.1-6ubuntu0.3 \
     python3=3.6.7-1~18.04 \
-    python3-pip=9.0.1-2.3~ubuntu1.18.04.1 \
-    cmake=3.17.3-0kitware1 \
+    python3-pip=9.0.1-2.3~ubuntu1.18.04.2 \
+    cmake=3.18.2-0kitware1 \
     ninja-build=1.8.2-1
 
 # box2d dependencies

I'm not sure if its the company firewall or something else.

And what is the error right now?

And what is the error right now?

BTW this sort of ping-pong should be done on a different forum, eg. on the Static Analyzer Discord channel of LLVM. Just saying.
You already know what I think about this since we discussed that on the mailing list.
Don't take it wrong, I would love it.

I tried to run the benchmark on the small set of projects but it would take an enormous time to analyze all such projects 20 times each...
Dedicating a box for this is unfeasible for me.
So I stuck with analyzing only tmux with 20 iterations.
The results are not convincing enough IMO.

ASDenysPetrov accepted this revision.EditedSep 4 2020, 8:39 AM

The change LGTM! If it really can speed up performance or improve memory organization, let's load this patch.

This revision is now accepted and ready to land.Sep 4 2020, 8:39 AM

Should I create more measurements?