Try to fix https://bugs.llvm.org/show_bug.cgi?id=34883
The problem is that instruction selection does not work for vector load element if the load is chained to a following prefetch.
Differential D38886
Try to avoid prefetches from disrupting folding of loads. jonpa on Oct 13 2017, 8:30 AM. Authored by
Details Try to fix https://bugs.llvm.org/show_bug.cgi?id=34883 The problem is that instruction selection does not work for vector load element if the load is chained to a following prefetch.
Diff Detail Event TimelineComment Actions It's unintuitive why you need to fix this at the IR level. Both the load and the prefetch should be uses of address and there should be no dependence between them. Let me comment in the bug. Comment Actions
While I found the reason (see Bugzilla) for the dependency, I have removed the changes in LoopDataPrefetch.pp as it is not needed anymore with the backend patch.
I tried first to see why the pattern matching failed. It seems that as soon as such a load has a chain, the matcher will give up. In particular, the test case had a DAG like t126: v4i32 = insert_vector_elt t125, t86, Constant:i32<1> t88: i32,ch = load<LD4[%scevgep32](tbaa=<0x5dd1ee8>)> t131, t80, undef:i64 t128: v4i32 = insert_vector_elt t126, t88, Constant:i32<2> t130: ch = SystemZISD::PREFETCH<LD1[%scevgep20]> t88:1, Constant:i32<1>, t4 t129: v4i32,ch = VLEF<Mem:LD4[%scevgep31](tbaa=<0x5dd1ee8>)> t128, t82 t88 has two users: (t88 load) | | | (data) | (t128 insert_vector_elt) | (ch) (t130 PREFETCH) During pattern matching of t128, I found that during pattern matching CR_InducesCycle was return here: SelectionDAGISel::SelectCodeCommon() of t128: case OPC_EmitMergeInputChains1_1: t88 is inserted into ChainNodesMatched HandleMergeInputChains() WalkChainUsers() // ChainedNode == t88 User == t130 // If the node isn't a token factor and isn't part of our pattern, then it // must be a random chained node in between two nodes we're selecting. // This happens when we have something like: // x = load ptr // call // y = x+4 // store y -> ptr // Because we structurally match the load/store as a read/modify/write, // but the call is chained between them. We cannot fold in this case // because it would induce a cycle in the graph. if (!std::count(ChainedNodesInPattern.begin(), // {t88} ChainedNodesInPattern.end(), User)) // t130 return CR_InducesCycle; This caught my attention since the assumption in the comment is not true. I wondered if I could fix this somehow, but came to the conclusion that even though it is intuitive in this test case to think that merging t128 and t88 is ok, it would for one thing require some isReachable-analysis to prove that t128 is not chained before t130. I was hoping to find some simple heuristic for a INSERT_VECTOR_ELT / PREFETCH context, but so far I am not sure if it's possible. It is interesting that
I then thought about implementing a custom isel for SystemZ / INSERT_VECTOR_ELT, which seemed to be possible but tedious. It would still have to be proved that the INSERT_VECTOR_ELT was not dependent somehow on the prefetch node, which seems more doable here in a limited context. Looking for a simple solution, I instead decided to just try to move the prefetch chains to before the load (of vector element) so that they would not disrupt pattern matching. This is what the current patch does. Some numbers for SystemZ / SPEC-2006: This is the debug output of the LoopDataPrefetch pass when it is considering prefetching (only a few of those end up prefetching): In fact, there is no loop on SPEC that gets prefetches with less than 10 iterations ahead. This is because SystemZ has getPrefetchDistance() { return 2000; }, to prefetch far ahead. Instruction count differences on SPEC: master patched l : 65679 65502 -177 vlef : 807 933 +126 * vlvgf : 910 784 -126 * vlvgp : 2160 2095 -65 * vlrepf : 454 517 +63 * vrepf : 2269 2206 -63 * lg : 319026 318982 -44 lgr : 341609 341634 +25 stg : 124367 124351 -16 lb : 5894 5879 -15 ly : 2213 2198 -15 vlvgb : 14 0 -14 * vleb : 253 267 +14 * lay : 18939 18952 +13 j : 98779 98776 -3 st : 51225 51222 -3 ltg : 40762 40760 -2 vlvgh : 68 66 -2 je : 94500 94498 -2 ... Spill|Reload : 165492 165427 -65 As can be seen, the VLE selection is remedied, and I have also confirmed by disabling the LoopDataPrefetch pass that with patch the number of VLEs are not affected at all anymore. Comment Actions Thinking about it, the current patch fails to update the chains properly: it's ok for the prefetch to be chained higher up in the DAG, but the user of that prefetch should perhaps *not* follow that prefetch up the DAG. Comment Actions After trying hard to fix this issue both in SystemZTargetLowering::lowerPREFETCH(), and by means of custom instruction selection, I figured that whatever I tried, it just became messy and difficult. I therefore decided to try to fix this during DAG construction instead, which was much simpler. SelectionDAGBuilder gets a new member in form of a TargetTransformInfo pointer. This is not strictly not strictly needed, but used to guard this change to work only for targets who return true in supportsEfficientVectorElementLoadStore(). The Intrinsics::prefetch handling in visitIntrinsicCall() is modified so that the chaining of the prefetch node is not done as rigidly as before in the case of a vector element load. This is enough to make the pattern matching work, and instruction selection is equally remedied as per previous version of patch. The reason that the prefetch is put into pending loads and then becoming the root is that the prefetches disappeared during DAGCombiner as they had no users otherwise. Not sure if there is a better way, but this seems simple, legal and works. Test case added. Notes:
Comment Actions ping! Does it look ok to let the DAGBuilder handle this instead (the clearly simpler solution)? Comment Actions So, with this change, what happens? Do the prefetches just all get chained together along with the last store (or other memory operation)?
I think you should remove this. Having this kind of special behavior dependent on seemingly-unrelated features is not desirable. Comment Actions The inital SelectionDAG of the added test case looks like (in part): ... t53: ch = TokenFactor t37:1, t49:1, t50:1, t51:1, t52 t48: i64 = add t2, Constant:i64<2220> t54: i32,ch = load<LD4[%scevgep37]> t53, t48, undef:i64 t66: i32,ch = load<LD4[%20]> t53, GlobalAddress:i64<[150 x %type0]* @Mem> + 408, undef:i64 t14: i64 = add t2, Constant:i64<237540> t67: ch = Prefetch<LD1[%scevgep26]> t53, t14, Constant:i32<0>, Constant:i32<3>, Constant:i32<1> t68: ch = TokenFactor t54:1, t66:1, t67 t61: i64 = add t2, Constant:i64<3108> t69: i32,ch = load<LD4[%scevgep36]> t68, t61, undef:i64 t12: i64 = add t2, Constant:i64<237984> t70: ch = Prefetch<LD1[%scevgep25]> t68, t12, Constant:i32<0>, Constant:i32<3>, Constant:i32<1> t71: ch = TokenFactor t69:1, t70 t63: i64 = add t2, Constant:i64<3552> t72: i32,ch = load<LD4[%scevgep35]> t71, t63, undef:i64 t10: i64 = add t2, Constant:i64<238428> t73: ch = Prefetch<LD1[%scevgep24]> t71, t10, Constant:i32<0>, Constant:i32<3>, Constant:i32<1> t74: ch = TokenFactor t72:1, t73 ... The result seems to be that each (load) prefetch gets chained in parallel with one load, which then in turn are the inputs to a TokenFactor (new DAG Root).
I certainly wouldn't mind removing the check, but I am curious about in what context you find this "seemingly-unrelated"? After all, this patch checks specifically for a load with a single user which is an InsertElement, and the sole purpose is then to help isel with combining those two instructions into a single "Vector Load Element" instruction. This should only be relevant if the target actually supports this, which is what supportsEfficientVectorElementLoadStore() signals. Are you saying that TTI does not belong in the DAGBuilder generally, or that this change is so small that the check isn't worthwhile? Comment Actions Okay, so the prefetch looks like any other load (i.e., it gets changed with the pending loads so that they can all happen in any order, but constrained by other observable side effects).
Understood, but I don't want that check either. I want to have a single, consistent way to lower the prefetch intrinsics into SDAG nodes. If we're going to treat them like loads, and thus re-orderable w.r.t. other loads (but not other writes, etc.), then we should do that in all cases. Do you see potential downsides to doing this? Comment Actions
This makes sense, and I don't see any problems with this. After all, the prefetch gets chained after the previous one in the sequence of vector element loads, so this should still do well also for other targets which prefetch with a smaller distance than SystemZ. Patch updated with the TTI check removed. Comment Actions To clarify: The reason not all load prefetches are handled in this new way (as a pending load), is due to your initial explanation that in a very large loop body, prefetching may be done just one or two iterations ahead. So for any general prefetch load this may theoretically mean a considerable rescheduling, but for a vector element load it should typically just mean that it will be scheduled before the immediately preceding vector element load. This is irrelevant for SystemZ, because prefetching is done >10 iterations ahead anyway, but I don't know for any other target... I would be happy to let you decide either way if the vector element loads should be specially treated or not. Comment Actions ping. Hal, I am just waiting for clarification from your side: If I understand you correctly, you don't want the TTI check (which I removed already), and you would actually prefer to handle all load prefetches the same way (meaning I should remove the checks to detect the vector-element-load case)? Comment Actions
Are you looking for SDNode's hasPredecessor (or hasPredecessorHelper if you can do more than one at a time)? Such dependency checks are not uncommon.
SPEC is not the world :-)
Comment Actions Patch updated per review. See in-line comments.
Comment Actions
Sorry, forgot to answer this when I updated the patch earlier today. That's good to know, but I still think that it is so much simpler to do this in a reasonable way to begin with. Especially since this should be a cross-platform win during isel at least. Besides, that is an expensive check. Comment Actions Yep. That was just an FYI comment.
Comment Actions
Happy to remove it... NFC on SystemZ. I guess I was just trying to handle the case I was seeing while changing as little as possible for other targets. |
I don't see why "prefetch for write" is different from "prefetch for read" in this regard. They're both just prefetches. I'd rather treat all prefetches the same way unless we have a particular reason to do otherwise.