This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Leverage speculation safety to avoid masked.loads
ClosedPublic

Authored by reames on Aug 23 2019, 3:03 PM.

Details

Summary

This is my first real patch to the loop vectorizer. There's some obvious room for API cleanup in the patch, but before I iterate on that, I want to make sure I'm approaching the problem the "right way" and that my attempt at clarifying comments are actually correct. :)

The intention of the patch is to avoid generating masked loads for cases where we can simply speculate the load, and then ignore the results. This is a building block on the path to a longer term goal, which is to eventually support early exits in the vectorizer. (But that's out of scope for the moment.)

Diff Detail

Repository
rL LLVM

Event Timeline

reames created this revision.Aug 23 2019, 3:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2019, 3:03 PM
lebedev.ri added inline comments.
test/Transforms/LoopVectorize/load-deref-pred.ll
1 ↗(On Diff #216963)

Precommit?

vivekvpandya added inline comments.Aug 24 2019, 3:39 AM
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
918 ↗(On Diff #216963)

Minor: loan -> load

975 ↗(On Diff #216963)

Minor: loop loop -> loop or you mean loop, loop header

xbolva00 added inline comments.
test/Transforms/LoopVectorize/load-deref-pred.ll
17 ↗(On Diff #216963)

br i1 false

Vectorizer should cleanup it a bit..

Ayal added a comment.Aug 26 2019, 4:53 AM

There's some obvious room for API cleanup in the patch, but before I iterate on that, I want to make sure I'm approaching the problem the "right way" and that my attempt at clarifying comments are actually correct. :)

Yes, this approach of refining SafePointe[r]s, originally designed to express derefereceability, LGTM.

It may be better to place isSafeToLoadUnconditionallyInLoop() in Load.{h,cpp} rather than in LoopVectorizationLegality.cpp; this needs to also work with FoldTail / IsAnnotatedParallel; and not sure about extending the load reasoning to stores etc., but all those can be part of "API cleanup".

Added some minor comments.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
20 ↗(On Diff #216963)

nit: lex ordering (the above admittedly sets an unordered example..)

950 ↗(On Diff #216963)

I'm a bit confused by Loads.cpp's comment, wrt Size:

/// This uses the pointee type to determine how many bytes need to be safe to
/// load from the pointer.
976 ↗(On Diff #216963)

Definitely better explanation, committable irrespective of this patch as NFC.

981 ↗(On Diff #216963)

The original if blockNeedsPredication(BB) condition was an early-continue; now more suitable that its complementary condition be an early-continue instead.
OTOH, can fuse the two "for (Instruction &I : *BB) if (auto *Ptr = ...)" loops together, with an early-continue if (!blockNeedsPredication(BB)) inside - or w/o it - in case isSafeToSpeculativelyExecute() will return true just as fast(?).
(Can check if Ptr is already in SafePointe[r]s to retain an early-continue ;-).

982 ↗(On Diff #216963)

typo: a[n] address

990 ↗(On Diff #216963)

Sure, we're not trying to hoist/LICM the load out of the loop.

993 ↗(On Diff #216963)

Does this "generic version" add anything, given that it returns false for stores, and does it account for the addresses accessed along all iterations of the loop?

1008 ↗(On Diff #216963)

Fold these conditions (of the generic version) into isSafeToLoadUnconditionallyInLoop()?

test/Transforms/LoopVectorize/load-deref-pred.ll
17 ↗(On Diff #216963)

Vectorizer leaves it and others like it for subsequent passes to clean up.

89 ↗(On Diff #216963)

This "for (i=0; i<4096; ++i) {val = 0; if (i < len) val = a[i]; acc += val}" loop is indeed probably a minimal example of a speculative load.
(It does deserve a better optimization though - that of trimming the loop bound ;-)

fhahn added a comment.Aug 26 2019, 9:36 AM

Thanks for the patch, looks good with the outstanding comments addressed. Additionally, it would be great to have a few more negative test cases (ptr operand not safe to speculatively execute, dynamic bounds, access in predicated block outside of loop).

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
997 ↗(On Diff #216963)

We only use the generic version (isSafeToSpeculativelyExecute) for pointer operands, right?

1020 ↗(On Diff #216963)

This is not needed, right?

This is a building block on the path to a longer term goal, which is to eventually support early exits in the vectorizer. (But that's out of scope for the moment.)

Is there anything more you can share on this in terms of design plans?

reames updated this revision to Diff 217233.Aug 26 2019, 1:42 PM

Rebase over landed tests (including a bunch of negative ones).

Note that this reveals a bug in the patch. Will fix, and iterate from there.

Expect several rounds of incremental improvements before ready for real review.

reames planned changes to this revision.Aug 26 2019, 1:42 PM
reames marked 4 inline comments as done.Aug 26 2019, 1:48 PM
xbolva00 added inline comments.Aug 26 2019, 1:58 PM
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1006 ↗(On Diff #217233)

Maybe also SanitizeMemory?

Ayal added a comment.Aug 27 2019, 2:23 AM

Rebase over landed tests (including a bunch of negative ones).

Note that this reveals a bug in the patch. Will fix, and iterate from there.

Expect several rounds of incremental improvements before ready for real review.

Sure. Additional points of thought, possibly as a separate follow-up patches:

  • Interleave groups may also be effected, e.g., two speculative interleaved loads from distinct basic blocks could potentially be combined; useMaskedInterleavedAccesses() deserves attention, as raised by test_non_unit_stride().
  • Perhaps caching SafePointers(InLoop) could be offloaded altogether from LV to Load.cpp.
reames updated this revision to Diff 217531.Aug 27 2019, 4:46 PM

Rebase over landed changes, functional issue identified in last iteration was fixed in rL370102.

reames updated this revision to Diff 217534.Aug 27 2019, 5:02 PM

Starting style cleanup, now passing all vectorize tests.

reames planned changes to this revision.Aug 27 2019, 5:02 PM

Still more work planned.

reames updated this revision to Diff 218190.Aug 30 2019, 4:50 PM

Actually ready for review now. I think I've covered all the cornercases and the structure of the code should be more reasonable.

I left in two todos about moving code. As a matter of practicality, I'd prefer to land as is, then do NFC moves in a follow up. Keeps the rebuild times much more manageable.

p.s. There was one comment in previous review about needing to handle the interleave case. I'm interpreting that as a request for further optimization (i.e. broadened scope) and have left it for future work. If it was actually a correctness concern, then please explain more.

xbolva00 added inline comments.Sep 6 2019, 9:32 AM
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
976 ↗(On Diff #218190)

Also SanitizeMemory?

reames marked an inline comment as done.Sep 6 2019, 3:49 PM
reames added inline comments.
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
976 ↗(On Diff #218190)

If so, as a separate change. This exactly matches the list in ValueTracking.

Ayal added a comment.Sep 8 2019, 11:21 AM

This LGTM, just some minor nits.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
918 ↗(On Diff #218190)

/// Return true if we can [prove] that the given

928 ↗(On Diff #218190)

What about loads from uniform addresses? If desired their unmasking can be delayed to a later patch, worth leaving a TODO.

935 ↗(On Diff #218190)

indent

"LoadSize"? This refers to the memory size of a single element.

936 ↗(On Diff #218190)

don't touch every byte >> have gaps

942 ↗(On Diff #218190)

It would be great to handle symbolic trip counts known to be sufficiently small.

945 ↗(On Diff #218190)

"Size"? This refers to the entire memory interval accessed by all iterations of the loop.

Set Size = LoadSize * TC?

Handling gaps would imply (only?) setting Size accordingly, right? That may be useful for unmasking gathers / interleave-groups.

948 ↗(On Diff #218190)

StartS must be LoopInvariant, iiuc. ("All operands of an AddRec are required to be loop invariant.")

This method is currently employed only by the innermost loop vectorizer; if/when considered by outer loop vectorization, there will be more than a single loop to look after; i.e., StartS may also be an AddRec to check.

957 ↗(On Diff #218190)

For the moment - as in TODO?

the access size - of a single element

Do we (need to) check that "the base is aligned"?

968 ↗(On Diff #218190)

Can first refactor ValueTracking.h/cpp's isSafeToSpeculativelyExecute().

969 ↗(On Diff #218190)

mustSuppressSpecula[ta]tion

1007 ↗(On Diff #218190)

a[n] address

1009 ↗(On Diff #218190)

For the moment - as in TODO?

1020 ↗(On Diff #218190)

If DL is left to be computed by callee rather than caller, can simplify into:

if (LI && !mustSuppressSpeculation(*LI) &&
    isDereferenceableAndAlignedInLoop(LI, TheLoop, SE, *DT))
  SafePointes.insert(LI->getPointerOperand());
20 ↗(On Diff #216963)

ValueTracking.h < VectorUtils.h

This revision was not accepted when it landed; it landed in state Needs Review.Sep 9 2019, 1:56 PM
This revision was automatically updated to reflect the committed changes.
reames marked an inline comment as done.
Ayal added a comment.Sep 11 2019, 12:59 PM

...
p.s. There was one comment in previous review about needing to handle the interleave case. I'm interpreting that as a request for further optimization (i.e. broadened scope) and have left it for future work. If it was actually a correctness concern, then please explain more.

Sorry for late response. This was indeed referring to potential for further optimizations, such as:

  1. Current logic for handling an interleave group (of loads) with gaps at the end, is to ensure there's at-least one scalar iteration following all vector iterations. This may result in running VF*UF scalar iterations instead of a single last vector/unrolled iteration. With the ability to "close this gap" - prove that all missing elements of the last vector iteration can be loaded speculatively, this can be improved.
  2. Current logic requires all members of an interleave group to belong to same basic-block. Unmasking loads should help facilitate forming interleave groups across distinct basic-blocks.
xbolva00 added inline comments.Sep 15 2019, 3:13 AM
llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1017

Do you plan to work on stores?

MaskRay added inline comments.
llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977

Is this isUnordered or isSimple?

IIUC isSimple is a subset of isUnordered.

reames added inline comments.Sep 15 2020, 11:36 AM
lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
957 ↗(On Diff #218190)

"the base is aligned" is checked within the called function a line below.

llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977

Well, it's written as intended, but that might not be what you're trying to ask? Do you have an example you're wondering about?

1017

I am not going to get back to this any time soon.

MaskRay added inline comments.Sep 15 2020, 11:42 AM
llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977

In D87538, VectorCombine::vectorizeLoadInsert uses the utility to suppress some cases. It uses Load->isSimple() but I don't know whether it should be isUnordered or mustSuppressSpeculation should use isSimple. Appreciate if you can take a look at that piece of code (you can ignore the most complexity there. The main thing is that it is a load widening)

reames added inline comments.Sep 15 2020, 11:53 AM
llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977

It is safe to speculate an unordered atomic load.

We're conservative in a bunch of places in the optimizer by using isSimple where isUnordered is legal, please don't add new ones if you can avoid it.

reames added inline comments.Sep 15 2020, 11:55 AM
llvm/trunk/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
977

In general, it is not safe to combine arbitrary atomic loads as we may not be able to lower a wider atomic. The code in InstCombine you reference also looks correct, but I don't really see the connection between the two reviews. They're reasoning about different aspects of legality.