This is an archive of the discontinued LLVM Phabricator instance.

[X86][CodeGenPrepare] Try to reuse IV's incremented value instead of adding the offset
ClosedPublic

Authored by mkazantsev on Feb 10 2021, 12:30 AM.

Details

Summary

While optimizing the memory instruction, we sometimes need to add
offset to the value of IV. We could avoid doing so if the IV.next is
already defined at the point of interest. In this case, we may get two
possible advantages from this:

  • If the IV step happens to match with the offset, we don't need to add the offset at all;
  • We reduce overlap of live ranges of IV and IV.next. They may stop overlapping and it will lead to better register allocation. Even if the overlap will preserve, we are not introducing a new overlap, so it should be a neutral transform.

Currently I've only added support for IVs that get decremented using usub
intrinsic. We could also support AddInstr, however there is some weird
interaction with some other transform that may lead to infinite compilation
in this case (seems like same transform is done and undone over and over).
I need to investigate why it happens, but generally we could do that too.

Diff Detail

Event Timeline

mkazantsev created this revision.Feb 10 2021, 12:30 AM
mkazantsev requested review of this revision.Feb 10 2021, 12:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2021, 12:30 AM
mkazantsev planned changes to this revision.Feb 10 2021, 10:40 PM
mkazantsev requested review of this revision.EditedFeb 11 2021, 3:53 AM

Found one reason of TODO, it may only happen with AddInstr, will add in follow-up patch. Seems that there is one more hang with AddInstr. So I suggest to have usub first and then follow-up with add support.

aqjune added a comment.EditedFeb 17 2021, 5:29 AM

The change from the previous output to the new output looks valid to me.

$ ./build/alive-tv D96399.srctgt.ll -disable-undef-input

----------------------------------------
define * @src(* %p, i64 %idx) {
%0:
  %addr1 = mul i64 %idx, 4
  %ptr = gep * %p, 1 x i64 %addr1
  %ptr2 = gep * %ptr, 1 x i64 -4
  ret * %ptr2
}
=>
define * @tgt(* %p, i64 %idx) {
%0:
  %res = usub_overflow {i64, i1, i24} %idx, 1
  %sub = extractvalue {i64, i1, i24} %res, 0
  %addr2 = mul i64 %sub, 4
  %ptr2 = gep * %p, 1 x i64 %addr2
  ret * %ptr2
}
Transformation seems to be correct!

(The disable-undef-input flag added to resolve timeout, but it is still okay with undef inputs I think)

reames requested changes to this revision.Feb 22 2021, 11:35 AM

Max,

A couple of high level comments here.

  1. You're code is specific to generating the address as GEPs. I think we want to handle this for the non-GEP path as well.
  2. I think this makes more sense to phrase as an optimization step for the AddrMode we're about to sink. Doing it that way should also address (1) above.
  3. I'm definitely not okay with the initial patch being so narrowly focused. If you were handling generic add and sub, but not the overflow intrinsics I'd be less concerned, but the very narrow focus on one of the overflow along with your todo makes me strongly suspect there is a lurking correctness issue which the usub tests simply happen not to expose.
  4. I think I might see the correctness issue, or at least a hint of it. Consider the case where addressing does overflow. The wrapping semantics of a GEP are not the same as the usubo. That difference means that if overflow occurs, your optimized AddrMode is incorrect. I believe you need to restrict this transform to when you can prove overflow causes the memory inst not to be reached.
This revision now requires changes to proceed.Feb 22 2021, 11:35 AM

You're code is specific to generating the address as GEPs. I think we want to handle this for the non-GEP path as well.

I think this makes more sense to phrase as an optimization step for the AddrMode we're about to sink. Doing it that way should also address (1) above.

Agreed. I tried doing this during AddrMode computation, but it's impossible because it's lacking knowledge about the user. But I think it can be hoisted out of condition.

I think I might see the correctness issue, or at least a hint of it. Consider the case where addressing does overflow. The wrapping semantics of a GEP are not the same as the usubo. That difference means that if overflow occurs, your optimized AddrMode is incorrect. I believe you need to restrict this transform to when you can prove overflow causes the memory inst not to be reached.

I don't quite get the point here. If overflow occurs in optimized case, it also occurs in non-optimized case. My transform does not change the actual offset, it just simplifies the way how it's computed.

mkazantsev planned changes to this revision.Feb 24 2021, 1:08 AM
mkazantsev retitled this revision from [X86][CodeGenPrepare] Try to reuse IV's incremented value instead of adding the offset to [X86][CodeGenPrepare] Try to reuse IV's incremented value instead of adding the offset (WIP).

Moved the logic into AddrMode optimization code, so that GEP and non-GEP modes can use it separately. So it should resolve concerns 1 and 2.

As for potential functional bugs and TODO, need to think more. Placing (WIP) until resolved.

mkazantsev planned changes to this revision.Feb 24 2021, 4:05 AM

Generalized to add/sub/uadd with overflow cases.

mkazantsev planned changes to this revision.Feb 25 2021, 10:03 PM

Need to address correctness issue with overflow.

mkazantsev retitled this revision from [X86][CodeGenPrepare] Try to reuse IV's incremented value instead of adding the offset (WIP) to [X86][CodeGenPrepare] Try to reuse IV's incremented value instead of adding the offset.

Added tests & safety checks against add/sub with nuw/nsw flags to avoid potential misuse of poisoned values. So far we conservatively restrain from optimizing such doubteous cases. Follow-up planned: try to prove flags is possible, if not - drop them (?).

reames requested changes to this revision.Feb 26 2021, 10:02 AM
reames added inline comments.
llvm/lib/CodeGen/CodeGenPrepare.cpp
1279

Can you precommit the extraction of the lambda to a static function? The diff is confusing to read.

1318–1319

Same with the rename here. (Or at least, this appears to be only a rename?)

3838

Please remove whitespace change, feel free to commit separately.

3868

You really should be able to factor out some common code here with the isIVIncrement function above. Maybe time for a getIVIncrement function?

Hm, though there appears to be a bug in this copy. The "incr" you identify doesn't appear to be tied to the PN.

5107–5108

The need for domtree here introduces a potential compile time problem given the way CGP manages domtree invalidation. Just noting it for now.

This revision now requires changes to proceed.Feb 26 2021, 10:02 AM

I suggest taking a look at the matchSimpleRecurrence routine I just added to ValueTracking. I strongly suspect you can simplify some of this code a lot.

mkazantsev added inline comments.Feb 28 2021, 8:08 PM
llvm/lib/CodeGen/CodeGenPrepare.cpp
3868

IVInc is the incoming value of the Phi node from the backedge which has the same phi node as argument thru matcher. That makes it tied, no?

mkazantsev marked 2 inline comments as done.Feb 28 2021, 9:15 PM
mkazantsev marked an inline comment as done.Feb 28 2021, 11:29 PM
mkazantsev added inline comments.
llvm/lib/CodeGen/CodeGenPrepare.cpp
5107–5108

Let's see if it will have visible measurable impact, and if so, think how to tackle it.

Addressed comments: all independent changes commited separately.

mkazantsev planned changes to this revision.Feb 28 2021, 11:42 PM
reames accepted this revision.Mar 3 2021, 3:56 PM

LGTM w/required comments.

llvm/lib/CodeGen/CodeGenPrepare.cpp
3890

I'm not really convinced of your second point above (e.g. the generic offset one). In particular, it's not clear to me that an arbitrary offset is *always* better than an overlapped live interval.

I'm also not convinced you're wrong.

I'm not going to require you stage this, but if you need to revert for any reason I'll strongly suggest staging first the case where the existing base offset cancels, and then doing the general offset case in a second patch.

3897

As written, there's a potential overflow here for SINT_MAX and Scale = 8. Please use APInt.smul_ov here.

3900

If I'm reading the code correctly, you want this to be adding IVInc to the instruction list, not the phi node.

This revision is now accepted and ready to land.Mar 3 2021, 3:56 PM
This revision was landed with ongoing or failed builds.Mar 4 2021, 12:23 AM
This revision was automatically updated to reflect the committed changes.
nikic added a comment.Mar 4 2021, 12:57 AM

From a cursory look, the problem is likely that you're doing an unconditional call to getDT(), which will lead to unnecessary DomTree calculations. The way this function is supposed to be used is directly when checking dominance, after other checks have already been done.

reames added a comment.Mar 4 2021, 9:34 AM

I went ahead and pushed a change (e0cfd451) to lazily compute domtree as per @nikic's guess. @xbolva00 do you still see the compile time regression with this change?

nikic added a comment.Mar 4 2021, 9:56 AM

I went ahead and pushed a change (e0cfd451) to lazily compute domtree as per @nikic's guess. @xbolva00 do you still see the compile time regression with this change?

Thanks! Here are the results for the range with both commits (it does not look like any other relevant changes landed in the meantime): http://llvm-compile-time-tracker.com/compare.php?from=b15ce2f344ac7845729d2be0a8316b20a32c6292&to=e0cfd451718e2524cc5b8f98ecd72a75d37146cc&stat=instructions

For most programs the regression is resolved, only mafft still has a noteworthy regression, as well as the ReleaseLTO-g configurations (don't know if related to the LTO part or the -g part though).

For most programs the regression is resolved, only mafft still has a noteworthy regression, as well as the ReleaseLTO-g configurations (don't know if related to the LTO part or the -g part though).

How concerned by this are you? To me this looks relatively isolated, and likely directly related to code changes in the respective benchmarks. (Though, is that an easy way to check that without building it all myself?) I can see one more small tweak I can make to defer domtree usage a bit further (will do so), but at some point it's pretty fundamental to this change (which I do think is worthwhile). Thoughts?

Pushed 6af94d22 which defers the domtree compute slightly further.

As an aside, the analysis invalidation in this code is a mess. We appear to be leaving loopinfo (which is derived from DT) stale. It just happens that the particular variety of stale doesn't expose problems in practice (joy). It's also not clear why we don't just update DT in the various transforms which modify it. None of them appear particularly hard to do.

nikic added a comment.Mar 4 2021, 11:47 AM

For most programs the regression is resolved, only mafft still has a noteworthy regression, as well as the ReleaseLTO-g configurations (don't know if related to the LTO part or the -g part though).

How concerned by this are you? To me this looks relatively isolated, and likely directly related to code changes in the respective benchmarks. (Though, is that an easy way to check that without building it all myself?) I can see one more small tweak I can make to defer domtree usage a bit further (will do so), but at some point it's pretty fundamental to this change (which I do think is worthwhile). Thoughts?

I'm not very concerned about that particular regression, but rather the potential implications. If this is indeed still caused by domtree calculations (rather than second order effects -- that also sounds plausible), then it is quite likely that there are also pathological cases (this is a typical issue for domtree invalidation + use in the same pass).

As an aside, the analysis invalidation in this code is a mess. We appear to be leaving loopinfo (which is derived from DT) stale. It just happens that the particular variety of stale doesn't expose problems in practice (joy). It's also not clear why we don't just update DT in the various transforms which modify it. None of them appear particularly hard to do.

Yeah, it's a real mess. What I find particularly concerning is that a lot of transforms set a ModifiedDT flag, despite not doing any CFG changes! For example combineToUAddWithOverflow() seems to use this flag to avoid instruction iterator invalidation -- flushing the DT and reprocessing the whole function is a pretty big hammer for that problem.

I think we're at the point where if we do see problematic cases due to dom tree handling, we should just fix CGP to not use the current invalidation scheme. Doing that is not a huge amount of work, and we're spending more time avoiding it than is warranted.

If necessary, I'll even volunteer to do it. I will put that off until we have a motivating test case though. :)