Download Raw Diff

Details

Reviewers

uweigand
anemet
hfinkel

Summary

Try to fix https://bugs.llvm.org/show_bug.cgi?id=34883

The problem is that instruction selection does not work for vector load element if the load is chained to a following prefetch.

Diff Detail

Event Timeline

jonpa created this revision.Oct 13 2017, 8:30 AM

Herald added a subscriber: kosarev. · View Herald TranscriptOct 13 2017, 8:30 AM

It's unintuitive why you need to fix this at the IR level. Both the load and the prefetch should be uses of address and there should be no dependence between them.

Let me comment in the bug.

It's unintuitive why you need to fix this at the IR level. Both the load and the prefetch should be uses of address and there should be no dependence between them.

While I found the reason (see Bugzilla) for the dependency, I have removed the changes in LoopDataPrefetch.pp as it is not needed anymore with the backend patch.

I believe that they're chained to prevent them from migrating to the bottom of a loop. For long loop bodies, they might only be prefetching ~1 iteration ahead, and it is important that they stay near the top if the relevant loads are near the top.

Your matching logic should probably look through the prefetches.

I tried first to see why the pattern matching failed. It seems that as soon as such a load has a chain, the matcher will give up. In particular, the test case had a DAG like

t126: v4i32 = insert_vector_elt t125, t86, Constant:i32<1>

 t88: i32,ch = load<LD4[%scevgep32](tbaa=<0x5dd1ee8>)> t131, t80, undef:i64
   t128: v4i32 = insert_vector_elt t126, t88, Constant:i32<2>
 t130: ch = SystemZISD::PREFETCH<LD1[%scevgep20]> t88:1, Constant:i32<1>, t4

   t129: v4i32,ch = VLEF<Mem:LD4[%scevgep31](tbaa=<0x5dd1ee8>)> t128, t82

t88 has two users:

(t88 load)
|   |
|   (data)
|   (t128 insert_vector_elt)
|
(ch)
(t130 PREFETCH)

During pattern matching of t128, I found that during pattern matching CR_InducesCycle was return here:

SelectionDAGISel::SelectCodeCommon() of t128:
 case OPC_EmitMergeInputChains1_1:

  t88 is inserted into ChainNodesMatched

  HandleMergeInputChains()
    WalkChainUsers() // ChainedNode == t88

      User == t130

      // If the node isn't a token factor and isn't part of our pattern, then it
      // must be a random chained node in between two nodes we're selecting.
      // This happens when we have something like:
      //   x = load ptr
      //   call
      //   y = x+4
      //   store y -> ptr
      // Because we structurally match the load/store as a read/modify/write,
      // but the call is chained between them.  We cannot fold in this case
      // because it would induce a cycle in the graph.
      if (!std::count(ChainedNodesInPattern.begin(),      // {t88}
                      ChainedNodesInPattern.end(), User)) // t130
        return CR_InducesCycle;

This caught my attention since the assumption in the comment is not true. I wondered if I could fix this somehow, but came to the conclusion that even though it is intuitive in this test case to think that merging t128 and t88 is ok, it would for one thing require some isReachable-analysis to prove that t128 is not chained before t130. I was hoping to find some simple heuristic for a INSERT_VECTOR_ELT / PREFETCH context, but so far I am not sure if it's possible. It is interesting that

The PREFETCH is *only* chained, while the INSERT_VECTOR_ELT is unchained.
The ChainedNodesInPattern always just contains one node with OPC_EmitMergeInputChains1_2 (in this case t88, with User t130). The conclusion then is that any chained load will return CR_InducecCycle with OPC_EmitMergeInputChains1_1.

I then thought about implementing a custom isel for SystemZ / INSERT_VECTOR_ELT, which seemed to be possible but tedious. It would still have to be proved that the INSERT_VECTOR_ELT was not dependent somehow on the prefetch node, which seems more doable here in a limited context.

Looking for a simple solution, I instead decided to just try to move the prefetch chains to before the load (of vector element) so that they would not disrupt pattern matching. This is what the current patch does.

Some numbers for SystemZ / SPEC-2006:

This is the debug output of the LoopDataPrefetch pass when it is considering prefetching (only a few of those end up prefetching):
Number of loops prefetching: 44607
Number of loops prefetching <10 iterations ahead: 210 (< 0.5%)
Number of loops prefetching <5 iterations ahead: 43 (< 0.1%)

In fact, there is no loop on SPEC that gets prefetches with less than 10 iterations ahead. This is because SystemZ has getPrefetchDistance() { return 2000; }, to prefetch far ahead.

Instruction count differences on SPEC:

                              master               patched
l              :                65679                65502     -177
vlef           :                  807                  933     +126  *
vlvgf          :                  910                  784     -126  *
vlvgp          :                 2160                 2095      -65  *
vlrepf         :                  454                  517      +63  *
vrepf          :                 2269                 2206      -63  *
lg             :               319026               318982      -44
lgr            :               341609               341634      +25
stg            :               124367               124351      -16
lb             :                 5894                 5879      -15
ly             :                 2213                 2198      -15
vlvgb          :                   14                    0      -14  *
vleb           :                  253                  267      +14  *
lay            :                18939                18952      +13
j              :                98779                98776       -3
st             :                51225                51222       -3
ltg            :                40762                40760       -2
vlvgh          :                   68                   66       -2
je             :                94500                94498       -2
...
Spill|Reload   :               165492               165427      -65

As can be seen, the VLE selection is remedied, and I have also confirmed by disabling the LoopDataPrefetch pass that with patch the number of VLEs are not affected at all anymore.

Thinking about it, the current patch fails to update the chains properly: it's ok for the prefetch to be chained higher up in the DAG, but the user of that prefetch should perhaps *not* follow that prefetch up the DAG.

After trying hard to fix this issue both in SystemZTargetLowering::lowerPREFETCH(), and by means of custom instruction selection, I figured that whatever I tried, it just became messy and difficult. I therefore decided to try to fix this during DAG construction instead, which was much simpler.

SelectionDAGBuilder gets a new member in form of a TargetTransformInfo pointer. This is not strictly not strictly needed, but used to guard this change to work only for targets who return true in supportsEfficientVectorElementLoadStore().

The Intrinsics::prefetch handling in visitIntrinsicCall() is modified so that the chaining of the prefetch node is not done as rigidly as before in the case of a vector element load. This is enough to make the pattern matching work, and instruction selection is equally remedied as per previous version of patch. The reason that the prefetch is put into pending loads and then becoming the root is that the prefetches disappeared during DAGCombiner as they had no users otherwise. Not sure if there is a better way, but this seems simple, legal and works.

Test case added.

Notes:

it seems possible to do this in lowerPREFETCH(), but it is more work to traverse the chain up through all VLEs and modify the chains. One function is needed to detect if isLoadIntoVectorElement(). The case of a TokenFactor must also be handled which makes it less simple of a fix. Then care has to be taken to change the chain users of the prefetch to the prefetches chain.
Handling this as an isel-problem means doing a custom selection of both VLRE and VLREP as they are two cases that emerge from this (first element of vector uses a REPLICATE node and the remaining INSERT_VECTOR_ELT). Deciding the opcodes and creating the new nodes became quite lengthy. Not to mention the fact that there would have to be some additional traversing of the DAG to check that no cycle is introduced.

ping!

Does it look ok to let the DAGBuilder handle this instead (the clearly simpler solution)?

So, with this change, what happens? Do the prefetches just all get chained together along with the last store (or other memory operation)?

SelectionDAGBuilder gets a new member in form of a TargetTransformInfo pointer. This is not strictly not strictly needed, but used to guard this change to work only for targets who return true in supportsEfficientVectorElementLoadStore().

I think you should remove this. Having this kind of special behavior dependent on seemingly-unrelated features is not desirable.

In D38886#914809, @hfinkel wrote:

So, with this change, what happens? Do the prefetches just all get chained together along with the last store (or other memory operation)?

The inital SelectionDAG of the added test case looks like (in part):

  ...
t53: ch = TokenFactor t37:1, t49:1, t50:1, t51:1, t52
    t48: i64 = add t2, Constant:i64<2220>
  t54: i32,ch = load<LD4[%scevgep37]> t53, t48, undef:i64
  t66: i32,ch = load<LD4[%20]> t53, GlobalAddress:i64<[150 x %type0]* @Mem> + 408, undef:i64
      t14: i64 = add t2, Constant:i64<237540>
    t67: ch = Prefetch<LD1[%scevgep26]> t53, t14, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t68: ch = TokenFactor t54:1, t66:1, t67
    t61: i64 = add t2, Constant:i64<3108>
  t69: i32,ch = load<LD4[%scevgep36]> t68, t61, undef:i64
      t12: i64 = add t2, Constant:i64<237984>
    t70: ch = Prefetch<LD1[%scevgep25]> t68, t12, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t71: ch = TokenFactor t69:1, t70
    t63: i64 = add t2, Constant:i64<3552>
  t72: i32,ch = load<LD4[%scevgep35]> t71, t63, undef:i64
      t10: i64 = add t2, Constant:i64<238428>
    t73: ch = Prefetch<LD1[%scevgep24]> t71, t10, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t74: ch = TokenFactor t72:1, t73
...

The result seems to be that each (load) prefetch gets chained in parallel with one load, which then in turn are the inputs to a TokenFactor (new DAG Root).

SelectionDAGBuilder gets a new member in form of a TargetTransformInfo pointer. This is not strictly not strictly needed, but used to guard this change to work only for targets who return true in supportsEfficientVectorElementLoadStore().

I think you should remove this. Having this kind of special behavior dependent on seemingly-unrelated features is not desirable.

I certainly wouldn't mind removing the check, but I am curious about in what context you find this "seemingly-unrelated"? After all, this patch checks specifically for a load with a single user which is an InsertElement, and the sole purpose is then to help isel with combining those two instructions into a single "Vector Load Element" instruction. This should only be relevant if the target actually supports this, which is what supportsEfficientVectorElementLoadStore() signals. Are you saying that TTI does not belong in the DAGBuilder generally, or that this change is so small that the check isn't worthwhile?

In D38886#915225, @jonpa wrote:

In D38886#914809, @hfinkel wrote:

So, with this change, what happens? Do the prefetches just all get chained together along with the last store (or other memory operation)?

The inital SelectionDAG of the added test case looks like (in part):

  ...
t53: ch = TokenFactor t37:1, t49:1, t50:1, t51:1, t52
    t48: i64 = add t2, Constant:i64<2220>
  t54: i32,ch = load<LD4[%scevgep37]> t53, t48, undef:i64
  t66: i32,ch = load<LD4[%20]> t53, GlobalAddress:i64<[150 x %type0]* @Mem> + 408, undef:i64
      t14: i64 = add t2, Constant:i64<237540>
    t67: ch = Prefetch<LD1[%scevgep26]> t53, t14, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t68: ch = TokenFactor t54:1, t66:1, t67
    t61: i64 = add t2, Constant:i64<3108>
  t69: i32,ch = load<LD4[%scevgep36]> t68, t61, undef:i64
      t12: i64 = add t2, Constant:i64<237984>
    t70: ch = Prefetch<LD1[%scevgep25]> t68, t12, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t71: ch = TokenFactor t69:1, t70
    t63: i64 = add t2, Constant:i64<3552>
  t72: i32,ch = load<LD4[%scevgep35]> t71, t63, undef:i64
      t10: i64 = add t2, Constant:i64<238428>
    t73: ch = Prefetch<LD1[%scevgep24]> t71, t10, Constant:i32<0>, Constant:i32<3>, Constant:i32<1>
  t74: ch = TokenFactor t72:1, t73
...

The result seems to be that each (load) prefetch gets chained in parallel with one load, which then in turn are the inputs to a TokenFactor (new DAG Root).

Okay, so the prefetch looks like any other load (i.e., it gets changed with the pending loads so that they can all happen in any order, but constrained by other observable side effects).

SelectionDAGBuilder gets a new member in form of a TargetTransformInfo pointer. This is not strictly not strictly needed, but used to guard this change to work only for targets who return true in supportsEfficientVectorElementLoadStore().

I think you should remove this. Having this kind of special behavior dependent on seemingly-unrelated features is not desirable.

I certainly wouldn't mind removing the check, but I am curious about in what context you find this "seemingly-unrelated"? After all, this patch checks specifically for a load with a single user which is an InsertElement, and the sole purpose is then to help isel with combining those two instructions into a single "Vector Load Element" instruction. This should only be relevant if the target actually supports this, which is what supportsEfficientVectorElementLoadStore() signals. Are you saying that TTI does not belong in the DAGBuilder generally, or that this change is so small that the check isn't worthwhile?

Understood, but I don't want that check either. I want to have a single, consistent way to lower the prefetch intrinsics into SDAG nodes. If we're going to treat them like loads, and thus re-orderable w.r.t. other loads (but not other writes, etc.), then we should do that in all cases. Do you see potential downsides to doing this?

Understood, but I don't want that check either. I want to have a single, consistent way to lower the prefetch intrinsics into SDAG nodes. If we're going to treat them like loads, and thus re-orderable w.r.t. other loads (but not other writes, etc.), then we should do that in all cases. Do you see potential downsides to doing this?

This makes sense, and I don't see any problems with this. After all, the prefetch gets chained after the previous one in the sequence of vector element loads, so this should still do well also for other targets which prefetch with a smaller distance than SystemZ.

Patch updated with the TTI check removed.

To clarify:

The reason not all load prefetches are handled in this new way (as a pending load), is due to your initial explanation that in a very large loop body, prefetching may be done just one or two iterations ahead. So for any general prefetch load this may theoretically mean a considerable rescheduling, but for a vector element load it should typically just mean that it will be scheduled before the immediately preceding vector element load.

This is irrelevant for SystemZ, because prefetching is done >10 iterations ahead anyway, but I don't know for any other target... I would be happy to let you decide either way if the vector element loads should be specially treated or not.

ping.

Hal, I am just waiting for clarification from your side: If I understand you correctly, you don't want the TTI check (which I removed already), and you would actually prefer to handle all load prefetches the same way (meaning I should remove the checks to detect the vector-element-load case)?

ping

Ping!

I then thought about implementing a custom isel for SystemZ / INSERT_VECTOR_ELT, which seemed to be possible but tedious. It would still have to be proved that the INSERT_VECTOR_ELT was not dependent somehow on the prefetch node, which seems more doable here in a limited context.

Are you looking for SDNode's hasPredecessor (or hasPredecessorHelper if you can do more than one at a time)? Such dependency checks are not uncommon.

In fact, there is no loop on SPEC that gets prefetches with less than 10 iterations ahead. This is because SystemZ has getPrefetchDistance() { return 2000; }, to prefetch far ahead.

SPEC is not the world :-)

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5816 ↗	(On Diff #121707)	I don't see why mentioning SystemZ here makes sense. Won't the affect any target that uses and code using SelectionDAG::areNonVolatileConsecutiveLoads, such as DAGCombiner::CombineConsecutiveLoads?
5824 ↗	(On Diff #121707)	This check is very fragile, please remove it. "which is good enough since a prefetch is always placed before its load." - No. Our current auto-prefetching pass happens to do that, but that's not the only way prefetches can appear in the code, and that behavior could change at any time. We shouldn't embed here something which depends on that behavior. What appears here should be the right thing to do in general.

Patch updated per review. See in-line comments.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5816 ↗	(On Diff #121707)	OK - updated the comment.
5824 ↗	(On Diff #121707)	I guess I was trying a bit too hard to not change behavior in other cases that what I was handling... I removed the check, which is NFC on SystemZ, and it should be generally good for other targets also, for example in cases where the load can be folded into an operand. The tests are still passing.

Are you looking for SDNode's hasPredecessor (or hasPredecessorHelper if you can do more than one at a time)? Such dependency checks are not uncommon.

Sorry, forgot to answer this when I updated the patch earlier today. That's good to know, but I still think that it is so much simpler to do this in a reasonable way to begin with. Especially since this should be a cross-platform win during isel at least. Besides, that is an expensive check.

In D38886#948428, @jonpa wrote:

Are you looking for SDNode's hasPredecessor (or hasPredecessorHelper if you can do more than one at a time)? Such dependency checks are not uncommon.

Sorry, forgot to answer this when I updated the patch earlier today. That's good to know, but I still think that it is so much simpler to do this in a reasonable way to begin with. Especially since this should be a cross-platform win during isel at least. Besides, that is an expensive check.

Yep. That was just an FYI comment.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5826 ↗	(On Diff #125933)	I don't see why "prefetch for write" is different from "prefetch for read" in this regard. They're both just prefetches. I'd rather treat all prefetches the same way unless we have a particular reason to do otherwise.

I don't see why "prefetch for write" is different from "prefetch for read" in this regard. They're both just prefetches. I'd rather treat all prefetches the same way unless we have a particular reason to do otherwise.

Happy to remove it... NFC on SystemZ. I guess I was just trying to handle the case I was seeing while changing as little as possible for other targets.

Patch rebased.

Let's try it (LGTM).

This revision is now accepted and ready to land.Dec 15 2017, 5:43 AM

r322163.

This is an archive of the discontinued LLVM Phabricator instance.

Try to avoid prefetches from disrupting folding of loads.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 118913

lib/Transforms/Scalar/LoopDataPrefetch.cpp

This is an archive of the discontinued LLVM Phabricator instance.

Try to avoid prefetches from disrupting folding of loads.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 118913

lib/Transforms/Scalar/LoopDataPrefetch.cpp

Try to avoid prefetches from disrupting folding of loads.
ClosedPublic