This is an archive of the discontinued LLVM Phabricator instance.

Reimplement heuristic for estimating complete-unroll optimization effects.
ClosedPublic

Authored by mzolotukhin on Apr 2 2015, 7:26 PM.

Details

Summary

This patch reimplements heuristic that tries to estimate optimization beneftis
from complete loop unrolling.

In this patch I kept the minimal changes - e.g. I removed code handling
branches and folding compares. That's a promising area, but now there
are too many questions to discuss before we can enable it.

Diff Detail

Event Timeline

mzolotukhin updated this revision to Diff 23200.Apr 2 2015, 7:26 PM
mzolotukhin retitled this revision from to Reimplement heuristic for estimating complete-unroll optimization effects..
mzolotukhin updated this object.
mzolotukhin edited the test plan for this revision. (Show Details)Apr 2 2015, 7:26 PM
mzolotukhin added reviewers: hfinkel, chandlerc.
mzolotukhin added a subscriber: Unknown Object (MLST).
chandlerc edited edge metadata.Apr 9 2015, 1:53 PM

Ok, full pass of comments below.

Generally, this is definitely looking better to me. I think there is still a number of things that could be simplified or refactored, but that can come later. The stuff below is just to try and get this iteration ready to go.

lib/Transforms/Scalar/LoopUnrollPass.cpp
255–256

Indentation seems off here..

258–262

No need to repeat the variable name.

I'd also call the variable "HaveSeenAR" for consistency with LLVM's naming conventions.

271–276

While you're here, can you make the argument be 'L' instead of 'loop'?

285–286

I somewhat dislike an assignment in a return expression. It's really hard to see when reading the code. Could you instead set the variable and then return? Or maybe use a function that aborts the traversal?

290

This is the only raw false here. Everything else that returns false sets IndexIsConstant to false first. If this is correct, could you comment on why?

365

SmallDenseMap?

411–413

Please don't use DenseMap like this. You're inserting a null value for every AddrOp you visit.

You want to use exactly the logic used above for the LHS and RHS expresisons:

Value *AddrOp = I.getPointerOperand();
if (!isa<Constant>(AddrOp))
  if (Constant *SimplifiedAddrOp = SimplifiedValues.lookup(AddrOp))
    AddrOp = SimplifiedAddrOp;

Better yet, just hoist this entire pattern into a helper called 'simplifyValue' or some such?

413–420

In what case is the base address null? It doesn't really make sense to add null base addr records to the cache to me?

414–416

Don't do two map lookups. Lookup the key once (using find because you don't just get a null pointer) and then test the iterator and use the iterator.

425–426

What ensures that this is always safe?

If something does ensure that this is always safe, I must assume it is when populating the structure. If that's the case, why can't we only store the unsigned variant?

428

What happens when "Step * Iteration" overflows? What about when "(Start + Step * Iteration)" overflows?

432–435

Under what circumstances is this not a constant?

If this is genuinely not a constant, should we really be considering the load "optimized"?

(Also, the cast<Value> shouldn't be needed?)

444–449

Would this second comment paragraph go better on the SCEV evaluation tool above? Or sunk into the implementation below? It seems kind of out of place here.

449

If the base address is a constant the GEP will also fold away though, so we should be able to mark it as optimized? (and we should be able to do this on each iteration)

451

Shouldn't you rinse this through the simplification map?

452–453

You should probably comment specifically that we expect to re-visit the same instructions repeatedly (once per iteration) and so we only want to do iteration-independent SCEV queries and computations once. I'd also probably extract all of the SCEV computation stuff below into a separate member function that you can comment as being iteration independent, etc. Then you can structure the visit somewhat more naturally.

489–498

I find it really weird to count optimized instructions rather than counting instructions that will remain *after* optimizations.

502–508

Rather than setting UnrolledLoopSize to UINT_MAX below to signal some inability to reasonably compute the unrolled size estimation, why not return true or false here? If this returns false, we have no useful data about the loop. Move on. If this returns true, then you can query for the detailed numbers.

504–506

I think you should have a FIXME to eventually remove the max iteration count to analyze. Once we shake the bugs out of the algorithm, it shouldn't be necessary. We should be willing to analyze any number of iterations as long as the un-optimized resulting instruction count is below a threshold.

571–575

I would handle all of this below where you're actually dealing with percentages. Handling it here seems really surprising and hard to understand.

694–707

Can you explain some of your motivations for having the double threshold and percentage query? It seems really awkward to implement, and so I'm curious what the need is here. If we could get around with just having a flat threshold, it'd make me happy. =]

842–860

I may just be forgetting where this is handled, but do we somewhere short-circuit the case where the total size of the loop body * the trip count is already below the threshold? Because we should. We shouldn't go and do the expensive analysis unless we at least have a large enough loop to be on the fence.

mzolotukhin added inline comments.Apr 9 2015, 6:15 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
694–707

The idea is the following: currently we have a threshold for unrolling small loops (~200 instructions). What I want to add is a possiblity to go beyond this threshold, but only if that gives a performance benefit. E.g. if unrolled loop would be 500 instructions, but it would be 30% faster than the original loop, then we want to unroll it. But we do not want to unroll this loop if it would become only 5% faster (in terms of cost of executed instructions). On the other hand, we don't want to unroll loops with huge trip counts, even if the resultant code seems to be faster. I.e. if unrolling would help to eliminate 50% of instructions, but the trip count is 10^9, we definitely don't want to unroll it.

And several examples to illustrate the idea:
a)

int b[] = {0,0,0...0,1}; // most of the values are 0
for (i = 0; i < 500; i++) {
  t = b[i] * c[i];
  a[i] = t * d[i];
}

If we completely unroll the loop, we'll get something like:

t = b[0]*c[0];
a[0] = t * d[0];
t = b[1]*c[1];
a[1] = t * d[1];
...
t = b[499]*c[499];
a[499] = t * d[499];

which would be simplified to:

a[0] = 0; // b[0] == 0
a[1] = 0; // b[1] == 0
...
a[498] = 0;  // b[498] == 0
a[499] = c[499]*d[499]; //b[499] == 1

That is, unrolling helps to remove ~50% instructions in this case - and that's not about code size, it's about execution time, because in the original loop we have to execute every MUL instruction, since we don't know exact value of b[i].

b)

/* The same example as before, but with a huge trip count. */
int b[] = {0,0,0...0,1}; // most of the values are 0
for (i = 0; i < 500000; i++) {
  t = b[i] * c[i];
  a[i] = t * d[i];
}

We want to give up on this loop, because unrolled version would be way too big. We might have some problems compiling it, and even if we compile it succesfully, we might be hit hard by cache/memory effects.

c)

/* The same example as (a), but unrolling doesn't help to simplify anything. */
int b[] = {6,2,3...4,7}; // no 0 or 1 values
for (i = 0; i < 500; i++) {
  t = b[i] * c[i];
  a[i] = t * d[i];
}

We don't want to just start unrolling any loop with higher trip count than we unrolled before, if that doesn't promise any performance benefits.

So, to distinguish (a) and (b), we use 'AbsoluteThreshold'. To distinguish (a) and (c) we use percentage.

mzolotukhin edited edge metadata.
  • Move check for possible NumberOfOptimizedInstructions*100 overflow into canUnrollCompletely.
  • Hoist computing SCEV expressions out of the main traversal loop. This step is semantically different from visiting instructions to check if they could become redundant after loop unrolling. It's different because we need to do it only once, while other visitors need to be run for every iteration of the loop.
  • Make analyzeLoop return bool.
  • Don't run the analysis if we can unroll the loop even without it.
  • Make SCEVGEPDescriptor's fields Start and Step uint64_t (previously, they were APInt).
  • Compute Index in uint64_t and make sure operands fit into 32-bit int.
  • Other small changes.

Hi Chandler,
Please find my answers inline. The patch is updated correspondingly, is it ok to commit it?

lib/Transforms/Scalar/LoopUnrollPass.cpp
255–256

Fixed.

258–262

Fixed.

271–276

Fixed.

285–286

Fixed.

290

I rewrote return statements in the function, now they use raw true/false values. I also return false instead of true in if (isa<SCEVConstant>(S)) - that doesn't matter since we can't "follow" into the SCEVConstant anyway, but false is more consistent here.

365

Makes sense, fixed.

411–413

Thanks, fixed! Though I didn't add a separate function for it.

413–420

You are right, it can't be null (it's checked when we prepare a new entry to SCEVCache). Fixed.

414–416

Fixed.

425–426

Thanks, I added constraints on the operands. Also, we now store unsigned invariant instead of APInt, as you suggested.

428

I made Index, Step, and Start uint64_t, while value in Start, Step and Iteration can't exceed 32-bit maximum. That should prevent overflows.

432–435

Thanks, fixed!

449

We don't support such optimization for GEPS (for now). We can add it later, and it'll naturally go to visitGetElementPtr, which is currently removed.

452–453

I think that we want to return to the original cacheSCEVResults approach. With that, we'd explicitly distinguish actions we want to do once (compute SCEVs and store interesting ones) from actions we want to perform on every simulated loop iteration (like trying to optimize LoadInst/BinaryOp/etc.). With that, added cacheSCEVResults and removed visitGetElementPtr.

489–498

I can change it, but in fact it doesn't sound so weird to me:)

502–508

That's a good idea, fixed according to your suggestion!

504–506

Is it possible that we can optimize all instructions in the loop body, and thus don't reach the threshold? I think it isn't, because we would have at least one instruction (branch) in the loop body, but I'm not confident here - maybe there could be some weird cases (i.e. cost of the branch is 0). If it's guaranteed that cost of the loop body is always > 0, then we can remove this limit.

571–575

Fixed!

842–860

Good point, fixed!

Rebase to trunk:

  • Reimplement heuristic for estimating complete-unroll optimization effects.
  • Address Chandler's remarks.
  • Address Chandler's remarks.
  • Address Chandler's remarks.
  • Fix merge fail.
  • Address Chandler's remarks.
  • Address Chandler's remarks.

Rebase on trunk.

  • Add a helper-function lookupSimplifiedValue.
mzolotukhin updated this revision to Diff 25389.May 8 2015, 4:37 PM
  • Rebase on recent trunk.
  • Prevent precision loss in UnrolledSize.
  • Remove unused VisitedGEPs.
  • Fix a typo in comment.
chandlerc requested changes to this revision.May 11 2015, 3:28 PM
chandlerc edited edge metadata.

Whew! Back to this at last. Sorry for the huge delay.

Lots of comments below, but I've marked some as good candidates for follow-up patches.

Can you let me know if it makes sense for me to take a look at the DCE stuff or if we should focus on getting this one landed first?

lib/Transforms/Scalar/LoopUnrollPass.cpp
277

Please add a comment to this function explaining what its trying to do.

333–334

Rather than all of this, you can just use a SmallSetVector<BasicBlock *, 16>. I would prefer this somewhat rather than the typedef...

339

Just use a named inner struct -- typedef-ing structs is only necessary in C modes.

341–342

While 64-bits should be enough for any common cases, if the SCEV code has APInts I would just continue to use them here so that we have full fidelity.

381–389

The spacing and mixture of doxygen style and non-doxygen style seems really messy here.

416

'd' isn't a very helpful name (aside from not matching the variable naming conventions).

418–420

Comment here to remind the reader that you're checking for the specific types of SCEVGEP loads that can be folded completely to a constant.

429–430

We at least need a comment or FIXME here as we shouldn't return false here. A load past the end of sequential constant data is an error, and so we should be free to fold it to nothing for the purpose of loop unroll cost estimation.

447–448

Why not a range based loop here?

451–455

It feels like we could probably hoist some of this out of the loop? Feel free to just leave a FIXME and not deal with it in this patch.

460

Again, 'd' is a bad name here.

Actually, I don't know why you create the descriptor this early? It seems like this region of code could just use the base addr from the visitor.

477

Is this to prevent overflow of the offset computations later? If so, comment that please. If not, what is it for?

480–483

I feel like this could just be an insert call? Or if you'd rather, something like:

SCEVCache[V] = {Visitor.BaseAddress, StartAP.getLimitedValue(),
                StepAP.getLimitedValue()};
508–515

I feel like this should probably be a doxygen comment.

518–519

Why do we zero these here rather than in the constructor?

539–540

This seems vacuous due to the requirement of a terminator...

546–551

Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

694–698

I would find it much more clear to just write the "percentage" check below in a way that wouldn't overflow:

uint64_t PercentOfOptimizedInstructions =
    (uint64_t)NumberOfOptimizedINstructions * 100ull / UnrolledSize;
702

I don't think the comment here is helping. I would just add an assert about it above, after the test.

855

Why do you need this? I'm surprised this isn't just directly using the UA's values?

This revision now requires changes to proceed.May 11 2015, 3:28 PM
mzolotukhin edited edge metadata.
  • Add a comment before follow().
  • Replace BBSetVector with SmallSetVector<BasicBlock *, 16>.
  • Doxygen-ify some comments.
  • Remove unnecessary variable 'SCEVGEPDescriptor d'.
  • Use range-base loop.
  • Replace 'd' with a meaningfull name.
  • Add some comments and fixme-s.
  • Initialize NumberOfOptimizedInstructions and UnrolledLoopSize in constructor.
  • Rewrite expression to avoid overflow.
  • Remove no longer needed overflow check.
  • Replace a comment with an assert.
  • Use UA.UnrolledLoopSize instead of min(UA.UnrolledLoopSize, UnrolledSize).
  • Add FIXME for out-of-bound access handling.
  • Remove redundant if(BB->empty()) check.
  • Replace typedef with a named struct.
  • Use APInt instead of uint64_t.
  • Add sanity checks before accessing SCEV.

Hi Chandler,

Thanks for the comments! I believe I've addressed in the source all of them, except this one:

Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

I kept counting the cost inside the visit function, because we might have three cases there:

  1. instruction was simplified to a constant (e.g. x = a[0] * y = 0 * y = 0)
  2. instruction was simplified, but not to a constant (e.g. x = a[0] + y = 0 + y = y)
  3. instruction wasn't simplified

In case we want to distinguish (1) and (2) outside the visit function, bool won't be enough. Currently we won't lose much by merging them though, but I didn't want to limit ourselves here from the very beginning.

chandlerc accepted this revision.May 12 2015, 9:11 AM
chandlerc edited edge metadata.

Generally, I think this can go in. There are a bunch of things I think should be cleaned up here, but they're fairly minor and I'm happy to just fix those and for a few that have more impact, send you patches.

Hi Chandler,

Thanks for the comments! I believe I've addressed in the source all of them, except this one:

Is there a reason you don't make visit() return a bool indicating whether it's cost should be counted or not, and localize all the counting in this function? It would be much easier to understand IMO.

I think I would also find it easier to read this as counting the number of instructions that will actually result from unrolling (essentially, the *un*optimized instructions) and the optimized instructions. You could still sum them and divide to compute the percentage, but it would make the threshold check not require subtraction. That could be done in a follow-up patch though.

I kept counting the cost inside the visit function, because we might have three cases there:

  1. instruction was simplified to a constant (e.g. x = a[0] * y = 0 * y = 0)
  2. instruction was simplified, but not to a constant (e.g. x = a[0] + y = 0 + y = y)
  3. instruction wasn't simplified

In case we want to distinguish (1) and (2) outside the visit function, bool won't be enough. Currently we won't lose much by merging them though, but I didn't want to limit ourselves here from the very beginning.

I don't think we need to distinguish between them. The key to realize is that if 'y' above were inside the loop body, we would already have accounted for its cost. The critical thing is whether we can fold 'x' away.

While perhaps we'll want more advanced heuristics, but I would rather assume not and simplify the code accordingly until a real use case arrives. (YAGNI, essentially.)

If you agree, I'm happy to make this change.

The only specific change I'd like to request you make in a follow-up are to ensure some of the test cases you mentioned in email exercising the percentage threshold etc are actually checked in as test cases.

Thanks!

This revision is now accepted and ready to land.May 12 2015, 9:11 AM
This revision was automatically updated to reflect the committed changes.

Thanks, Chandler!

I've committed the patch and will follow-up with the tests today.

There are a bunch of things I think should be cleaned up here, but they're fairly minor and I'm happy to just fix those and for a few that have more impact, send you patches.

Sure, go ahead!

Michael