This is an archive of the discontinued LLVM Phabricator instance.

[UnrolledInstAnalyzer] Use MSSA to find stored values outside of loop.
Needs ReviewPublic

Authored by fhahn on Sep 16 2019, 5:17 AM.

Details

Summary

In UnrolledInstAnalyzer, we already simplify the addresses of loads
inside the loop to a base pointer and offset. We can use this
information to check if we can find a store outside the loop that is a
def for the load, if unrolled.o

To find potentially defining stores, this patch uses MemorySSA. I am not
entirely sure if it is a good idea to rely on MemorySSA here already.
@asbirlea, what do you think?

If it makes sense to use MemorySSA here, I'd make sure we preserve it
during loop unrolling. I guess we would have to make sure it gets
preserved until the next MemorySSA user in the pipeline, to negate the
compile time impact.

Event Timeline

fhahn created this revision.Sep 16 2019, 5:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 16 2019, 5:17 AM

Here are a couple of answers/discussion points:

First of, could you please elaborate a bit more on what you're trying to check for? Is it stores before the loop, after the loop or both?
Normally we'd use the walker on the instruction (MSSA->getWalker()->getClobberingMemoryAccess(&I)) to look before the loop. Ideally we'd avoid going into the innards of MSSA in loop passes, so if a usecase not covered by the walker is needed that should be added either to a walker or to the updater.
It looks like a good use of MemorySSA, but the search should happen behind an API.

Secondly, in the new pass manager it is necessary for MSSA to be preserved by all loop passes in the same loop pipeline, because analyses are not actually invalidated until the entire loop pipeline is done. Hence it's possible to end up using an invalidated analysis and only notice this on a crash, miscompile or non-deterministic behavior. This occurred in D65060 and I tried to add a safety net in D66376 to prevent this from happening again (LPM2 will not use MSSA even if LoopUnroll is taught to preserve it, until all other passes in LPM2 all preserve it).

As far as preserving MemorySSA in LoopUnroll, this is a complex task. I had started work in this, but I did not complete it because unrolling does a lot of cloning/copying, and adding instructions "out of nowhere" to MSSA can lead to complex updates. It may be worth evaluating if updating MSSA is worth it for the unroll pass, or if it's faster to rebuild it from scratch. I did not get a chance to make this evaluation.

Here are a couple of answers/discussion points:

First of, could you please elaborate a bit more on what you're trying to check for? Is it stores before the loop, after the loop or both?

The idea here is to check stores before the loop and use the store to determine the loaded value in a specific loop iteration. We only do that if we can determine a base pointer + constant offset for a certain loop iteration.

Normally we'd use the walker on the instruction (MSSA->getWalker()->getClobberingMemoryAccess(&I)) to look before the loop. Ideally we'd avoid going into the innards of MSSA in loop passes, so if a usecase not covered by the walker is needed that should be added either to a walker or to the updater.
It looks like a good use of MemorySSA, but the search should happen behind an API.

Right, but the walker just does a single step, which might not be the defining instructions we are looking for. Would the recommended way to implement this to implement a new walker, which just looks for a MemDef with said base pointer + offset?

Secondly, in the new pass manager it is necessary for MSSA to be preserved by all loop passes in the same loop pipeline, because analyses are not actually invalidated until the entire loop pipeline is done. Hence it's possible to end up using an invalidated analysis and only notice this on a crash, miscompile or non-deterministic behavior. This occurred in D65060 and I tried to add a safety net in D66376 to prevent this from happening again (LPM2 will not use MSSA even if LoopUnroll is taught to preserve it, until all other passes in LPM2 all preserve it).

As far as preserving MemorySSA in LoopUnroll, this is a complex task. I had started work in this, but I did not complete it because unrolling does a lot of cloning/copying, and adding instructions "out of nowhere" to MSSA can lead to complex updates. It may be worth evaluating if updating MSSA is worth it for the unroll pass, or if it's faster to rebuild it from scratch. I did not get a chance to make this evaluation.

Right, I expect this to be a fair bit of work, but I'd be happy to look into it as pre-requisite for this patch.

Here are a couple of answers/discussion points:

First of, could you please elaborate a bit more on what you're trying to check for? Is it stores before the loop, after the loop or both?

The idea here is to check stores before the loop and use the store to determine the loaded value in a specific loop iteration. We only do that if we can determine a base pointer + constant offset for a certain loop iteration.

This is the part I do not understand. It looks like the LoadInst is considered "free" and not marked towards the cost of the unroll if there is a store with the same base pointer + offset.
Is that always the case? If there is an interfering MemoryAccess between the two (the walker's answer), then the LoadInst is not free.
I'm not that familiar with the code here, so feel free to correct me if I misunderstood.

Normally we'd use the walker on the instruction (MSSA->getWalker()->getClobberingMemoryAccess(&I)) to look before the loop. Ideally we'd avoid going into the innards of MSSA in loop passes, so if a usecase not covered by the walker is needed that should be added either to a walker or to the updater.
It looks like a good use of MemorySSA, but the search should happen behind an API.

Right, but the walker just does a single step, which might not be the defining instructions we are looking for. Would the recommended way to implement this to implement a new walker, which just looks for a MemDef with said base pointer + offset?

Yes, if this is your goal, having a new walker do this would be the way to go.

Secondly, in the new pass manager it is necessary for MSSA to be preserved by all loop passes in the same loop pipeline, because analyses are not actually invalidated until the entire loop pipeline is done. Hence it's possible to end up using an invalidated analysis and only notice this on a crash, miscompile or non-deterministic behavior. This occurred in D65060 and I tried to add a safety net in D66376 to prevent this from happening again (LPM2 will not use MSSA even if LoopUnroll is taught to preserve it, until all other passes in LPM2 all preserve it).

As far as preserving MemorySSA in LoopUnroll, this is a complex task. I had started work in this, but I did not complete it because unrolling does a lot of cloning/copying, and adding instructions "out of nowhere" to MSSA can lead to complex updates. It may be worth evaluating if updating MSSA is worth it for the unroll pass, or if it's faster to rebuild it from scratch. I did not get a chance to make this evaluation.

Right, I expect this to be a fair bit of work, but I'd be happy to look into it as pre-requisite for this patch.

The idea here is to check stores before the loop and use the store to determine the loaded value in a specific loop iteration. We only do that if we can determine a base pointer + constant offset for a certain loop iteration.

This is the part I do not understand. It looks like the LoadInst is considered "free" and not marked towards the cost of the unroll if there is a store with the same base pointer + offset.
Is that always the case? If there is an interfering MemoryAccess between the two (the walker's answer), then the LoadInst is not free.
I'm not that familiar with the code here, so feel free to correct me if I misunderstood.

It is probably more me not being as familiar with the MemorySSA API. I thought def_chain() would include all memory defs, that could clobber the load, in reverse order. With that in mind, if we hit a non-store or cannot compute the base pointer & offset, we exit. We probably also need another check to make sure we do not miss base pointers that alias. Does that make sense?

The idea here is to check stores before the loop and use the store to determine the loaded value in a specific loop iteration. We only do that if we can determine a base pointer + constant offset for a certain loop iteration.

This is the part I do not understand. It looks like the LoadInst is considered "free" and not marked towards the cost of the unroll if there is a store with the same base pointer + offset.
Is that always the case? If there is an interfering MemoryAccess between the two (the walker's answer), then the LoadInst is not free.
I'm not that familiar with the code here, so feel free to correct me if I misunderstood.

It is probably more me not being as familiar with the MemorySSA API. I thought def_chain() would include all memory defs, that could clobber the load, in reverse order. With that in mind, if we hit a non-store or cannot compute the base pointer & offset, we exit. We probably also need another check to make sure we do not miss base pointers that alias. Does that make sense?

In other words, I was looking for a convenience method abstracts that walking as a range iterator.

Would something like this work for your purpose?

if (MSSA) {
    auto *D = MSSA->getWalker()->getClobberingMemoryAccess(&I);
    if (MemoryDef *MD = dyn_cast<MemoryDef>(D))
        if (StoreInst *SI = dyn_cast<StoreInst>(MD->getMemoryInst())) 
            if (Optional<SimplifiedAddress> MaybeSimpAddr =
                simplifyNonLoopAddress(SI->getPointerOperand()))
                if (MaybeSimpAddr->Base == AddrBase &&
                    MaybeSimpAddr->Offset == SimplifiedAddrOp)
                    return true;
}

I don't know if you want to return false; on some of those branches or continue with the checks below.

The walker will stop at the first access that may alias the Load. If it's not a Def, it's a Phi and it should give up, similarly if it's a Def but not a Store. Then, only if it's a store and has the conditions you added, the load can be folded/cost is ignored.

fhahn added a comment.Sep 16 2019, 1:21 PM

Would something like this work for your purpose?

if (MSSA) {
    auto *D = MSSA->getWalker()->getClobberingMemoryAccess(&I);
    if (MemoryDef *MD = dyn_cast<MemoryDef>(D))
        if (StoreInst *SI = dyn_cast<StoreInst>(MD->getMemoryInst())) 
            if (Optional<SimplifiedAddress> MaybeSimpAddr =
                simplifyNonLoopAddress(SI->getPointerOperand()))
                if (MaybeSimpAddr->Base == AddrBase &&
                    MaybeSimpAddr->Offset == SimplifiedAddrOp)
                    return true;
}

I don't know if you want to return false; on some of those branches or continue with the checks below.

The walker will stop at the first access that may alias the Load. If it's not a Def, it's a Phi and it should give up, similarly if it's a Def but not a Store. Then, only if it's a store and has the conditions you added, the load can be folded/cost is ignored.

Yep, but that lets us only look at a single def, right? The load in the loop will be clobbered by multiple store, so the first one returned might not be the one we are interested in. Because we look at the load in a certain iteration, we may be able to simplify the pointer to a base pointer + constant offset. We should be able to skip MemDefs, if we can prove they store to different offsets for the same base pointer.

MSSA should have already done this verification. The walker will skip over stores that do not clobber the load or other MemoryAccess-es that do not interfere with the load.
So if there is a Store with the same pointer and different offset, the aliasing result for the two should be that they don't alias, and the walker will continue until it finds an access that does clobber the load.

Perhaps worth mentioning that the walker also does phi translation when walking defs upward.

I think I see the gap in communication here.

If you ask MSSA for the closest definition for a load inside a loop, it will return the closest store which might alias the load at any iteration. This is what you want if you want to, for example, eliminate a load using a stored value.

What we really want here, though, is not MSSA for the load, as it exists in the loop. Instead the question is, what would MSSA return after the loop is unrolled? (It might not return the same def: alias analysis has more information after unrolling.) We could speculatively unroll the loop and recompute MSSA, but that would be really expensive. Instead, the code here is walking through the defs to try to find the def that will be relevant after unrolling.

fhahn added a comment.Sep 25 2019, 1:57 PM

I think I see the gap in communication here.

If you ask MSSA for the closest definition for a load inside a loop, it will return the closest store which might alias the load at any iteration. This is what you want if you want to, for example, eliminate a load using a stored value.

What we really want here, though, is not MSSA for the load, as it exists in the loop. Instead the question is, what would MSSA return after the loop is unrolled? (It might not return the same def: alias analysis has more information after unrolling.) We could speculatively unroll the loop and recompute MSSA, but that would be really expensive. Instead, the code here is walking through the defs to try to find the def that will be relevant after unrolling.

Thanks for clarifying Eli! @asbirlea, does this make more sense now?

For a bit more context: UnrolledInstAnalyzer simulates unrolling the loop and for each iteration it tries to estimate what can be simplified in that iteration. So when it analyzes the first iteration, it knows what the concrete value of the induction variable will be. If we can simplify an address computation to a base address + offset using that information, we look at clobbering stores outside the loop of the original load, and see if we can find a store that stores exactly to the same location. Note that memorySSA does not have this additional information and the load in the loop will be clobbered by any store that involves an address covered by the loop.

The underlying motivation to do that is that such store/load pairs can be eliminated after unrolling and should not count towards the size of the unrolled loop.

fhahn added a comment.Dec 12 2019, 1:51 PM

ping

@asbirlea do the additional comments by Eli and myself make things clearer (sorry for just coming back to this now)

Sorry, I forgot about this one, thank you for the reminder!

Yes, the explanation makes sense to me.

I would go with adding the logic for how to walk the defs into a Walker.
There are quite a few things that need to be changed in the above code using the def_chain.
e.g.:
There are no MemoryUses, so that check is not needed.
The chain stops at a MemoryPhi, where you currently break (give up). The way this is written right now, this will always break, and it can be improved with walking the backedge (walkers do this).
The def_chain starts off by getting the defining access for the given Load (Use), which may skip Stores (Defs) it does not alias in the same iteration. But that load may alias the stores in another iteration, so walking *all* defs above current use seems appropriate.
I don't know how nested loops are handled here.