This is an archive of the discontinued LLVM Phabricator instance.

ValueTracking: Use fixed array for assumption exclude set in Query; NFC
ClosedPublic

Authored by MatzeB on Jan 14 2016, 3:44 PM.

Details

Summary

After r251146 computeKnownBits() is called multiple times for every instruction in the program, which resulted in a 3% compiletime regression. This patch tries to get some of that compiletime back by optimizing the function:

The Query structure is constructed often and is relevant for compiletime
performance. We can replace the SmallPtrSet for assumption exclusions in
this structure with a fixed size array because we know the maximum
number of elements. This improves typical clang -O3 -emit-llvm compiletime
by 1.2% in my measurements.

Diff Detail

Repository
rL LLVM

Event Timeline

MatzeB updated this revision to Diff 44939.Jan 14 2016, 3:44 PM
MatzeB retitled this revision from to ValueTracking: Use fixed array for assumption exclude set in Query; NFC.
MatzeB updated this object.
MatzeB added reviewers: hfinkel, majnemer, sanjoy.
MatzeB set the repository for this revision to rL LLVM.
MatzeB added a subscriber: llvm-commits.
reames added a subscriber: reames.Jan 14 2016, 4:42 PM

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic. The gains appear to come from the simplified initialization (and maybe reduced stack usage?).

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.

Wow, that was a fast turn around. How'd you managed to build something after touching an ADT header that fast much less performance test it? I want to know your secret. :)

I have to admit, this really surprises me.

The gains appear to come from the simplified initialization (and maybe reduced stack usage?).

Can you back up this claim? I'm open to being convinced this is true, but it would imply that we're creating *a lot* of Query objects. Do you believe that to be true? If so, we could explore using a stack discipline for the exclusions rather than passing around copies of the full Query.

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.

Wow, that was a fast turn around. How'd you managed to build something after touching an ADT header that fast much less performance test it? I want to know your secret. :)

I have to admit, this really surprises me.

The gains appear to come from the simplified initialization (and maybe reduced stack usage?).

Can you back up this claim? I'm open to being convinced this is true, but it would imply that we're creating *a lot* of Query objects. Do you believe that to be true? If so, we could explore using a stack discipline for the exclusions rather than passing around copies of the full Query.

  1. I did not modify any ADT headers, I only experimented within ValueTracking.cpp. With tuning the insertion/query logic I meant working on the Query structure inside ValueTracking.cpp, and if you read the code you need to have an llvm.assume intrinsic or 1 specific comparison for the set to be used so unsurprisingly that does not happen often.

computeKnownBits() is indeed called a lot and if you compare the initialisation of a SmallPtrSet:

 : SmallArray(SmallStorage), CurArray(SmallStorage), CurArraySize(SmallSize)
// ...
     if (!isSmall() && NumElements*4 < CurArraySize && CurArraySize > 32)
      return shrink_and_clear();

    // Fill the array with empty markers.
    memset(CurArray, -1, CurArraySize*sizeof(void*));
    NumElements = 0;
    NumTombstones = 0;

compared to:

: NumExcluded(0)

You also get additional const void SmallArray;, const void CurArray;, unsigned CurArraySize;, unsigned NumElements;, unsigned NumTombstones; members for the SmallPtrSet.

I don't know if this is really the instructions or the stack space saving and I am hoping it's not just cache effects because the stack alignment changed. But I did take care to measure a release clang, on a machine with most daemons killed, ASLR disabled, 3 different longer testcases (combine.c from 403.gcc, sqlite3 amalgamation, webkit javascript interpreter in a single file) run multiple times, so I am confident the measurements themself are real.

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.

Wow, that was a fast turn around. How'd you managed to build something after touching an ADT header that fast much less performance test it? I want to know your secret. :)

I have to admit, this really surprises me.

The gains appear to come from the simplified initialization (and maybe reduced stack usage?).

Can you back up this claim? I'm open to being convinced this is true, but it would imply that we're creating *a lot* of Query objects. Do you believe that to be true? If so, we could explore using a stack discipline for the exclusions rather than passing around copies of the full Query.

I am not sure what you mean by Stack Discipline, can you elaborate (or give me a link to an explanation of the concept)?

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.

Wow, that was a fast turn around. How'd you managed to build something after touching an ADT header that fast much less performance test it? I want to know your secret. :)

I have to admit, this really surprises me.

The gains appear to come from the simplified initialization (and maybe reduced stack usage?).

Can you back up this claim? I'm open to being convinced this is true, but it would imply that we're creating *a lot* of Query objects. Do you believe that to be true? If so, we could explore using a stack discipline for the exclusions rather than passing around copies of the full Query.

I added a counter and got 686810 Query instances for my testcase that runs for 1.5s not sure if that is enough to expect a 1-2% performance diff...

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.

I have to admit, this really surprises me.

  1. I did not modify any ADT headers, I only experimented within ValueTracking.cpp. With tuning the insertion/query logic I meant working on the Query structure inside ValueTracking.cpp, and if you read the code you need to have an llvm.assume intrinsic or 1 specific comparison for the set to be used so unsurprisingly that does not happen often.

Ok, I think we're talking past each other. I specifically asked you to run an experiment with a slightly altered implementation of SmallPtrSet to help isolate the cause of the performance improvement. I thought you had said you'd done so, now it sounds like you have not. I would like you to run that experiment or otherwise argue why the outlined insertion implementation isn't the primary cost you're saving here.

computeKnownBits() is indeed called a lot and if you compare the initialisation of a SmallPtrSet:

 : SmallArray(SmallStorage), CurArray(SmallStorage), CurArraySize(SmallSize)
// ...
     if (!isSmall() && NumElements*4 < CurArraySize && CurArraySize > 32)
      return shrink_and_clear();

    // Fill the array with empty markers.
    memset(CurArray, -1, CurArraySize*sizeof(void*));
    NumElements = 0;
    NumTombstones = 0;

compared to:

: NumExcluded(0)

You also get additional const void SmallArray;, const void CurArray;, unsigned CurArraySize;, unsigned NumElements;, unsigned NumTombstones; members for the SmallPtrSet.

In the small case, all of these should be zero initialized except for CurArray. This should reduce to a handful of stores. I'm not going to say it's impossible that such a minor difference matters, but I'm suspicious. You could convince me by adding a couple of dummy fields to the Query and showing that they do have the performance change you're claiming they do.

I don't know if this is really the instructions or the stack space saving and I am hoping it's not just cache effects because the stack alignment changed. But I did take care to measure a release clang, on a machine with most daemons killed, ASLR disabled, 3 different longer testcases (combine.c from 403.gcc, sqlite3 amalgamation, webkit javascript interpreter in a single file) run multiple times, so I am confident the measurements themself are real.

This is good information. I'll believe you the difference is real; I'm just trying to understand why. Once we know why, we can decide if this is the right fix and whether there's other ways we could continue to improve the performance of this code.

I am not sure what you mean by Stack Discipline, can you elaborate (or give me a link to an explanation of the concept)?

What I meant in this case is that we appear to be using a Query object to wrap an existing one with a single extra exclusion added to it. This is conceptual a stack of exclusions along with a single unchanging Query. We could explicitly model this as a Query without exclusions and an explicit exclusions stack that we push/pop from on each recursive call. We could even introduce a ExclusionStackHelper to do the push/pop for us implicitly.

Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster? The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.

I'd really doubt that (a) is the cause. If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.

Please try measuring with a tweaked version of SmallPtrSet. Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.

In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.

I have to admit, this really surprises me.

  1. I did not modify any ADT headers, I only experimented within ValueTracking.cpp. With tuning the insertion/query logic I meant working on the Query structure inside ValueTracking.cpp, and if you read the code you need to have an llvm.assume intrinsic or 1 specific comparison for the set to be used so unsurprisingly that does not happen often.

Ok, I think we're talking past each other. I specifically asked you to run an experiment with a slightly altered implementation of SmallPtrSet to help isolate the cause of the performance improvement. I thought you had said you'd done so, now it sounds like you have not. I would like you to run that experiment or otherwise argue why the outlined insertion implementation isn't the primary cost you're saving here.

I will do that another time, as that is not relevant here. I just added a counter and as suspected, we nearly never insert or query anything from the set. I just added a counter and found 0 queries or insertion into the set in all my testcases. So all we have to worry in the normal case is construction and size of the set.

computeKnownBits() is indeed called a lot and if you compare the initialisation of a SmallPtrSet:

 : SmallArray(SmallStorage), CurArray(SmallStorage), CurArraySize(SmallSize)
// ...
     if (!isSmall() && NumElements*4 < CurArraySize && CurArraySize > 32)
      return shrink_and_clear();

    // Fill the array with empty markers.
    memset(CurArray, -1, CurArraySize*sizeof(void*));
    NumElements = 0;
    NumTombstones = 0;

compared to:

: NumExcluded(0)

You also get additional const void SmallArray;, const void CurArray;, unsigned CurArraySize;, unsigned NumElements;, unsigned NumTombstones; members for the SmallPtrSet.

In the small case, all of these should be zero initialized except for CurArray. This should reduce to a handful of stores. I'm not going to say it's impossible that such a minor difference matters, but I'm suspicious. You could convince me by adding a couple of dummy fields to the Query and showing that they do have the performance change you're claiming they do.

You are right to be sceptical: I did some experiments and found that doing doing some memsetting, or adding additional members to the Query structure has no measurable performance implications. However adding an unused SmallPtr<void*,8> Dummy; member triggers the performance regressions, so something more complicated appears to be going on here...

I don't know if this is really the instructions or the stack space saving and I am hoping it's not just cache effects because the stack alignment changed. But I did take care to measure a release clang, on a machine with most daemons killed, ASLR disabled, 3 different longer testcases (combine.c from 403.gcc, sqlite3 amalgamation, webkit javascript interpreter in a single file) run multiple times, so I am confident the measurements themself are real.

This is good information. I'll believe you the difference is real; I'm just trying to understand why. Once we know why, we can decide if this is the right fix and whether there's other ways we could continue to improve the performance of this code.

I am not sure what you mean by Stack Discipline, can you elaborate (or give me a link to an explanation of the concept)?

What I meant in this case is that we appear to be using a Query object to wrap an existing one with a single extra exclusion added to it. This is conceptual a stack of exclusions along with a single unchanging Query. We could explicitly model this as a Query without exclusions and an explicit exclusions stack that we push/pop from on each recursive call. We could even introduce a ExclusionStackHelper to do the push/pop for us implicitly.

In fact I even had the code changed to this. But as the code paths that actually create new Query objects are nearly never hit (see above) I didn't see any performance difference.

sanjoy edited edge metadata.Jan 17 2016, 10:28 AM

Minor drive-by comments inline.

lib/Analysis/ValueTracking.cpp
102 ↗(On Diff #44939)

Why not std::array?

117 ↗(On Diff #44939)

This can be a call to find or any_of from STLExtras.h if you use an std::array for Excluded.

sanjoy added inline comments.Jan 19 2016, 12:18 PM
lib/Analysis/ValueTracking.cpp
117 ↗(On Diff #44939)

Sorry, just realized that it won't.

MatzeB updated this revision to Diff 46100.Jan 26 2016, 9:03 PM
MatzeB edited edge metadata.

Update patch to be more C++y with std::array and std::find.

Any chance to get this accepted on the grounds that it definitely does not slow down the code even if I can't explain why the improvements in my measurements are as big as they are?

reames accepted this revision.Jan 27 2016, 4:47 PM
reames added a reviewer: reames.

LGTM w/minor comments addressed.

lib/Analysis/ValueTracking.cpp
96 ↗(On Diff #46100)

Stale comment.

115 ↗(On Diff #46100)

Place this assert after the increment as an <=. They're equivalent, but the alternative code is easier to read.

121 ↗(On Diff #46100)

If you zero initialize the array, you can simply this by just scanning the extra elements. (optional)

Adding an early exit when NumExcluded == 0 is likely to be worthwhile.

This revision is now accepted and ready to land.Jan 27 2016, 4:47 PM
MatzeB marked 5 inline comments as done.Jan 27 2016, 10:32 PM

Thanks for the review!

lib/Analysis/ValueTracking.cpp
121 ↗(On Diff #46100)

Initializing the array is something I avoid on purpose here because in the huge majority of the case the array is not used at all.

This revision was automatically updated to reflect the committed changes.