Page MenuHomePhabricator

[ValueTracking] Interpret GEPs as a series of adds multiplied by the related scaling factor
ClosedPublic

Authored by qcolombet on Aug 21 2020, 12:36 PM.

Details

Summary

Prior to this patch, computeKnownBits would only try to deduce trailing zeros bits for getelementptrs.
This patch adds the logic to treat geps as a series of add * scaling factor.

To do so, we refactored the logic in computeKnownBitsMul to expose an API that works on a pair of KnownBits. Then, we used it in conjunction of computeKnownBitsAddSub to compute the series of adds * mul when processing geps.

Thanks to this patch, using a gep or performing an address computation directly "by hand" (ptrtoint followed by adds and mul followed by inttoptr) is going to offer the same computeKnownBits information.

Previously, the "by hand" approach would have given more information.

This is related to https://llvm.org/PR47241.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Add scalable vector support (well really abort on scalable vector types!)

qcolombet added inline comments.Aug 21 2020, 6:34 PM
llvm/lib/Analysis/ValueTracking.cpp
1401

The bug in instrprof-value-prof.c is that this code is only correct for array. For structures the index is not based on the scaling factor.
Fixing that.

qcolombet updated this revision to Diff 287154.Aug 21 2020, 8:53 PM
  • Rework the logic so that TrailZ and AddrKnownBits share most of the logic
  • Fix the handling of structure accesses in the process.
qcolombet added inline comments.Aug 21 2020, 9:03 PM
llvm/test/Transforms/InstCombine/constant-fold-gep.ll
55

Note: The change in the input of the test is because the alignment was not consistent with the base pointer.
On this store the alignment cannot be 8, since the base pointer is aligned on 16 (17 * 4 + 16 == 17 * 4 + 4 * 4 == 21 * 4 == 10 * 8 + 4 i.e., remainder with 8 cannot be zero). More easily seen looking at the previous load: @Y + 16 is aligned to 8 then we add 4 to make @Y + 17, and we claimed it was aligned to 8.

115

Test case for the structure access bug found during pretesting.
With the offset calculation bug of the previous version of the patch we would have got align 4.

nikic requested changes to this revision.Aug 21 2020, 11:46 PM

At least in the current implementation, the compile-time impact of this change is too large: http://llvm-compile-time-tracker.com/compare.php?from=12edd4b36475170d445ac93da34e4883f23a8361&to=1b942d71a1d7e7e55c91a2240eaaba617cc26d5a&stat=instructions

Possibly adding the early exit on unknown bits will help.

llvm/lib/Analysis/ValueTracking.cpp
1363

I don't understand the purpose of this variable and why we track both the precise known bits *and* the trailing zeros. In the one place that sets TrackAddr to false, can't you set the known bits to unknown?

1372

This should early exit if the AddrKnownBits are unknown, which would be the equivalent of the TrailZ==0 check.

This revision now requires changes to proceed.Aug 21 2020, 11:46 PM
aqjune added inline comments.Aug 22 2020, 1:44 AM
llvm/test/Transforms/InstCombine/constant-fold-gep.ll
55

I agree that the changes were necessary, otherwise the stores should always raise UB due to the alignment mismatch.
It seems that the last three stores below are always raising UB as well - the size of @Y is 72 bytes, so the accesses below are out-of-bounds.
Should we treat these in this patch (by removing the three lines or expanding the size of the global variable), or is it okay to just leave them as they were?

Hi @nikic

Thanks for the review.

At least in the current implementation, the compile-time impact of this change is too large

This change makes more calls to computeKnownBits, so the compile-time impact is expected and I don't see how we can avoid any of the calls.

Do you have recommendations on how to limit the impact?

Cheers,
-Quentin

llvm/lib/Analysis/ValueTracking.cpp
1363

The trailing zeros may still be tracked while the AddrKnownBits are unknown. This happens when we hit scalable vectors.
I kept this variable to make sure we don't regress computeKnownBits in these cases.

Regarding the TrackAddr variable I can turn that into checking if AddrKnownBits are unknowns, I was not just what was the compile time impact of doing this.

1372

That's not true because AddrKnownBits won't be tracked as soon as we hit a scalable vector whereas the TrailZ information can still be tracked.

llvm/test/Transforms/InstCombine/constant-fold-gep.ll
55

I think the idea here is that the test was checking whether the inbounds keyword was added.

Possibly adding the early exit on unknown bits will help.

I'll add that to see if it helps.

Use AddrKnownBits.isUnknown instead of having a dedicated bool variable.

qcolombet added inline comments.Aug 24 2020, 10:29 AM
llvm/lib/Analysis/ValueTracking.cpp
1428

@nikic FYI here is where we stop to track the full address but we still compute the trailing zeros.

If we don't do that, we will lose some information compared to the previous implementation.
E.g., test/Analysis/ScalarEvolution/scalable-vector.ll will fail
:5:10: error: CHECK: expected string not found in input
; CHECK: --> (3 * sizeof(<vscale x 4 x i32>)) U: [0,-15) S: [-9223372036854775808,9223372036854775793)

^

<stdin>:4:2: note: scanning from here
--> (3 * sizeof(<vscale x 4 x i32>)) U: full-set S: full-set

Hi @nikic,

I've made the suggested changes, how do I kick a new compile time measurement?

Cheers,
-Quentin

nikic added a comment.Aug 25 2020, 1:05 PM

Here are the results for the new version: http://llvm-compile-time-tracker.com/compare.php?from=188f1ac301c5c6da6d2f5697952510fc39cbdd43&to=a43ce9e792896679e71314fc368a027a2c9ae544&stat=instructions

This is better than before, but still quite expensive for what it does.

llvm/lib/Analysis/ValueTracking.cpp
1428

Okay, I see. I think we should still be able to combine these with some special handling in this spot. Would something like this work?

  • Combine IndexBits using computeKnownBitsMul as usual.
  • For scalable vectors, the result is additionally multiplied by an unknown vscale. As an approximation, only keep the trailing zero bits of IndexBits in this case, and then continue as usual.
1440

We should add an sextOrTrunc() method to KnownBits. We already have zextOrTrunc() and anyextOrTrunc().

Hi @nikic,

Thanks for the updated numbers!

This is better than before, but still quite expensive for what it does.

What's an acceptable target for you?

The reason I am asking is because I fear we won't be able to shave much more from the implementation, while for us not having this information is pretty damaging. Put differently, the previous implementation is lacking critical information compared to hand written pointer arithmetic that could lead to missed optimization opportunities and dramatic performance lose.

I agree, that in itself this change doesn't seem worth the added compile time, but I think we should see it as an enabler for more optimizations. For instance, this change doesn't fix https://llvm.org/PR47241 but is a step in that direction. In other words, it doesn't fix it, because InstCombine didn't have this kind of information before and thus doesn't know how to act on it, but longer term, having this information would likely improve several optimizations.

Cheers,
-Quentin

llvm/lib/Analysis/ValueTracking.cpp
1428

That should be doable though I think it will degrade compile time even more given I would expect that calling computeKnownBitsMul would be more expensive than just doing the math on TrailZ.

Also, are you suggesting that we get rid of TrailZ all together and just carry the trailing zeros bits in AddrKnownBits?
If so, we would add back a bunch of computeKnownBits calls since AddrKnownBits won't be unknown anymore.

What do you think?

1440

Sure.

  • Add KnownBits::sextOrTrunc
qcolombet marked an inline comment as done.Aug 26 2020, 10:30 AM

Does this (and follow-up patches) improve any public/private benchmarks? Any data?

Does this (and follow-up patches) improve any public/private benchmarks? Any data?

Yes it improves some private benchmarks but to be fair the optimizations that it helps are not public.

I didn't do any follow-up patches yet for the public issue.
I didn't dig, but for the in-tree targets, the one benefit that I saw so far with this commit alone is the improved alignment information, e.g., test/Transforms/InstCombine/constant-fold-gep.ll. This may already help some optimizations like vectorization or instruction selecting aligned loads.

Cheers,
-Quentin

One thing we could do to help compile time is maybe add more caching to the Query API.
Right now it only caches llvm.assume AFAICT, whereas we could cache known bits for values. That's what we did in the Machine IR variant of computeKnownBits.

This would be a functional change though as in theory not having a cache could give you more information. E.g., imagine we hit a value at Depth 6 with a max depth of 6 and then later on, we hit it again at Depth 4. With a cache that value would have the information cached on the first hit, thus we would only have information for it with 0 depth worth of analysis, whereas with the current implementation the second call would have 2 depth worth of analysis.

Ping @nikic.

Let me know if the current compile time impact is acceptable or if I need to come up with a different strategy.

nikic added a comment.Sep 6 2020, 1:28 PM

One thing we could do to help compile time is maybe add more caching to the Query API.
Right now it only caches llvm.assume AFAICT, whereas we could cache known bits for values. That's what we did in the Machine IR variant of computeKnownBits.

To clarify, are you referring to caching within a single known bits query here (for the case where one instruction is recursively referenced multiple times), or across multiple queries? Not sure how useful the former would be, I don't have a good idea how often we do redundant known bits computations in one query. The latter has been tried at some point, but I think the conclusion was that properly invalidating that cache in InstCombine is too fragile.

Let me know if the current compile time impact is acceptable or if I need to come up with a different strategy.

I don't think the current impact here is acceptable: This patch has very little benefit for standard targets (we really don't care about alignment all that much), but is a pretty significant compile-time regression, about 1.5% for tramp4-3d. That's not a good tradeoff.

I do have an old experiment lying around that would make this more feasible: Moving alignment inferral out of InstCombine into a separate pass that only runs once shortly before codegen (https://github.com/llvm/llvm-project/commit/d95001f63a7fb57a3c08002e05a751106ec8e10c for the basic idea -- it was a geomean 1% improvement when I tested it). This reduces the number of known bits queries on GEPs a lot and would make us less susceptible to how expensive they are.

llvm/lib/Analysis/ValueTracking.cpp
1428

I don't think doing this would have a (negative) impact on compile-time, as it only affects the case where GEPs on scalable vectors are used (i.e. very rarely).

And yes, with this change it should be possible to get rid of TrailZ entirely and only use AddrKnownBits.

Hi @nikic,

Thanks for the comments.

To clarify, are you referring to caching within a single known bits query here (for the case where one instruction is recursively referenced multiple times), or across multiple queries?

I was referring to caching within a single known bits.
For GISel it actually helped quite a lot, but anyhow this is orthogonal.

This patch has very little benefit for standard targets (we really don't care about alignment all that much),

This actually helps more than alignment. The idea is that you get ranges for some addresses and on some targets this actually can get a dramatical improvement because some addressing modes are not supported for all ranges.

This reduces the number of known bits queries on GEPs a lot and would make us less susceptible to how expensive they are.

Although interesting, again I am not really after alignment here and really about ranges. I.e., the full bit analysis is required.

Anyway, I'll rework the patch to have the ability to inject some target specific code here (probably by plumbing TargetTransfromInfo in here) and we'll see how this looks.

Cheers,
-Quentin

Hi @nikic,

I've made a proof of concept in D87342 on how we could have some pieces of computeKnownBits be target dependent.

Would that other approach be suitable?

Cheers,
-Quentin

qcolombet abandoned this revision.Sep 14 2020, 11:48 AM

Abandon in favor of D87342.

qcolombet reclaimed this revision.Oct 6 2020, 4:34 PM

Resurrecting this diff.
I need to rebase on top of the refactoring: D88937 and D88935.

Rebase patch
Add the reviewers from D87342

qcolombet updated this revision to Diff 299224.Oct 19 2020, 5:08 PM

Get rid of TrailZ and track everything with AddrKnownBits.

Hi @nikic, @spatel, @lebedev.ri, @RKSimon, @efriedma,

Following the conversation in D87342 , this patch has been rebased and all the previous feedbacks have been addressed.

We're left with the compile time issue that @nikic brought up.

Any idea on how to tackle that?

Cheers,
-Quentin

Hi @nikic, @spatel, @lebedev.ri, @RKSimon, @efriedma,

Following the conversation in D87342 , this patch has been rebased and all the previous feedbacks have been addressed.

We're left with the compile time issue that @nikic brought up.

Any idea on how to tackle that?

I have not stepped through this in detail, so not sure if it helps:

  1. It's been a couple of months, so we should confirm that the time impact is still the same.
  2. It was suggested earlier to trigger the analysis from a different/dedicated pass rather than instcombine. Is that not feasible?
  3. We're using the more expensive analysis on all GEPs. Could we use the existing analysis as the common case and limit the more expensive some way (for example, only GEPs with N or more operands, inbounds, etc)?

I have not stepped through this in detail, so not sure if it helps:

  1. It's been a couple of months, so we should confirm that the time impact is still the same.

Here's some new numbers: https://llvm-compile-time-tracker.com/compare.php?from=0c0fcea557e4a7cfd51216ad20aa67c82733ab52&to=2bf154f40abacf90d0adffaaec20b01f6bd06481&stat=instructions So this is now at 0.35% geomean, with 1.3% on tramp3d-v4. This is definitely much better than where this started.

  1. It was suggested earlier to trigger the analysis from a different/dedicated pass rather than instcombine. Is that not feasible?

Right. Most of the cost here comes from the fact that we compute alignment for all loads and stores something like 10 times (due to many instcombine runs, which are likely to also visit instructions multiple times), and those alignment calculations will commonly compute GEP known bits. I understand that the actual use case for this patch is not actually the alignment calculation, that's just where this ends up being expensive. If we remove that from the equation, I expect the compile-time impact of this patch would be pretty minimal. I started some work on this, but I'm not sure when I'll be able to finish it.

A couple special cases might improve performance:

  1. Check for and skip zero array indexes, similar to the special case for zero struct indexes, since they show up frequently.
  2. Accumulate constant indexes separately from variable ones: you can compute the sum of the constant indexes, and computeForAddSub with the result, instead of using computeForAddSub on each index.
  3. Add a special case for computeForAddSub on a constant; the arithmetic is a bit simpler.

Hi @nikic, @spatel

If we remove that from the equation, I expect the compile-time impact of this patch would be pretty minimal. I started some work on this, but I'm not sure when I'll be able to finish it.

That's an interesting observation. It would fit my use case to have a different API for compute known bits for all vs. compute known bits for all but only alignment for geps.

That said, from the users stand point how would we decide which version we want to use. E.g., what would be the criteria for instcombine?

If we look at https://llvm.org/PR47241 for instance, instcombine would need the full range to be able to optimize that code, but maybe we don't care about this use case.
Similarly, if we look at the change in llvm/test/Transforms/InstCombine/constant-fold-address-space-pointer.ll, even if we are only interested in the alignment, the new computation can yield better results.

It's been a couple of months, so we should confirm that the time impact is still the same.

Thanks @nikic for the new measurements!
Looks like it's smaller, but the benefits may still not be worth it in general.

Let me know what do you think.

It was suggested earlier to trigger the analysis from a different/dedicated pass rather than instcombine. Is that not feasible?

Repeating what I said earlier in that message, for my use case, that would certainly work. The one difficulty is that it needs to be exposed as a full fledged version of compute knowns bits, because I don't know before hand if the expression dag I am looking at contains any gep (and in particular the root may not be a gep). So we have to somehow convey within compute known bits that we want to compute only TrailZ instead of the full information. I would rather that we avoid this kind of plumbing, especially because it would create a precedent for new improvements in compute known bits, which if we go down that road is not very far from D87342.

We're using the more expensive analysis on all GEPs. Could we use the existing analysis as the common case and limit the more expensive some way (for example, only GEPs with N or more operands, inbounds, etc)?

I can give that a try and see how it impacts compile time. The cut-off would be somewhat arbitrary though.

Stay tuned.

Cheers,
-Quentin

Thanks @efriedma for the suggestions!

Let me try that before deciding if we want the precise vs. less precise approach.

qcolombet updated this revision to Diff 299495.Oct 20 2020, 3:42 PM
  • Skip zero indices
  • Accumulate constant indices with constant scaling factor in a separate variable to avoid calls to computeForAddSub

Did 1 & 2 of @eli.friedman's suggestions.

After a quick look, I didn't see how to do #3 and left it out. We may want to give it a try separately as it may be a general improvement to the known bits infrastructure.

@nikic, could you do a measurement of the latest patch, please?

Then we can decide if we want to pursue either the cut-off idea or somehow expose the precise vs. less precise gep analysis.

Oups, tagged Eli's old account.
Tagging the new one @efriedma

nikic added a comment.Oct 21 2020, 1:03 AM

New numbers: https://llvm-compile-time-tracker.com/compare.php?from=0c0fcea557e4a7cfd51216ad20aa67c82733ab52&to=723ef6ad7cc817f5b00605f58d559330abae29e2&stat=instructions Down to 0.1% geomean, without any large outliers. That's good enough for me :) Thank you @efriedma for those suggestions.

I'll review the actual implementation later today.

nikic accepted this revision.Oct 21 2020, 11:31 AM

LGTM. Thank you for your patience and continued work on this issue.

llvm/lib/Analysis/ValueTracking.cpp
1361–1364

As far as I can see, you could just rename LocalKnown to AddrKnownBits, rather than first computing into LocalKnown and then copying it.

Or, going one step further, you could directly populate Known, rather than populating AddrKnownBits and then copying it at the end.

1364

Nit: Cst -> Const. This seems like an unusual abbreviation.

1383

Move these down to the computeKnownBits call? Doesn't appear to be used in between.

llvm/test/Transforms/InstCombine/constant-fold-gep.ll
111

Typo: CHECK; instead of CHECK:. I would suggest to simply rerun update_test_checks.py instead.

This revision is now accepted and ready to land.Oct 21 2020, 11:31 AM
  • Use Known directly instead of using an intermediate AddrKnownBits variable
  • Change AccCstIndices into AccConstIndices
  • Move initialization code of IndexBits next to first use
  • Regenerate test case using update_test_check
qcolombet marked 4 inline comments as done.Oct 21 2020, 12:18 PM

Thanks all for your help and @nikic in particular for doing the perf measurements and pushing for a better solution.
I think this shows that reviews work and produce better code quality!

Will push shortly.

llvm/lib/Analysis/ValueTracking.cpp
1361–1364

Good point, we don't need the intermediate variables anymore.

1364

Done!

1383

Good catch, we don't need that here anymore.
Moved it down.

llvm/test/Transforms/InstCombine/constant-fold-gep.ll
111

Oh wow!
Sorry about that and nice catch!

Ran the update script (when I originally wrote the test, the checks were not autogenerated.)

This revision was landed with ongoing or failed builds.Oct 21 2020, 3:07 PM
This revision was automatically updated to reflect the committed changes.
qcolombet marked 4 inline comments as done.