This is an archive of the discontinued LLVM Phabricator instance.

[LV] Suppress vectorization in some nontemporal cases
ClosedPublic

Authored by wristow on May 9 2019, 3:21 PM.

Details

Summary

When considering a loop containing nontemporal stores or loads for
vectorization, suppress the vectorization if the corresponding
vectorized store or load with the aligment of the original scaler
memory op is not supported with the nontemporal hint on the target.

This adds two new functions:
bool isLegalNTStore(Type *DataType, unsigned Alignment) const;
bool isLegalNTLoad(Type *DataType, unsigned Alignment) const;

to TTI, leaving the target independent default implementation as
returning true, but with overriding implementations for X86 that
check the legality based on available Subtarget features.

This fixes https://llvm.org/PR40759

Diff Detail

Repository
rL LLVM

Event Timeline

wristow created this revision.May 9 2019, 3:21 PM

One point I want to explicitly raise, is that prior to this proposed change, the current vectorization implementation ignores the nontemporal hint when checking whether a memory op is vectorizable. So conceptually, it's as though the two new routines:
bool isLegalNTStore(Type *DataType, unsigned Alignment)
bool isLegalNTLoad(Type *DataType, unsigned Alignment)
return true for all types and alignments. With that in mind, I've made the (default) target-independent implementations of these functions return true -- it's only the overriding functions on X86 (which is the target PR40759 was reported against) that will ever return false here (so on non X86 targets, this would be an NFC commit). But in looking through the other backends, I believe that none of them support misaligned nontemporal mem ops. So possibly a better default would be to have these new functions return true only if the specified DataType is minimally aligned at Alignment.

Would it be possible to add tests where non-temporal load/stores successfully vectorize?

lib/Target/X86/X86TargetTransformInfo.cpp
3080 ↗(On Diff #198918)

SSE4A nt-stores can happen with any alignment, and AFAICT without any perf penalty.

wristow marked an inline comment as done.May 28 2019, 11:14 AM

Would it be possible to add tests where non-temporal load/stores successfully vectorize?

Glad to see your comment about SSE4A supporting nt-stores at any alignment. With that, I can make an X86 test-case that does vectorize.

lib/Target/X86/X86TargetTransformInfo.cpp
3080 ↗(On Diff #198918)

I didn't realize that. I'll update the patch, and include a test for it.

wristow marked an inline comment as done.May 28 2019, 4:47 PM
wristow added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
3080 ↗(On Diff #198918)

Looking into this, I'm confused... Are you saying (for example) that with SSE4A, vmovntps can do an nt-store with a misaligned address? Looking through the docs, I'm seeing a requirement for the address to be aligned.

Or are you saying (for example) the SSE4A instruction movntss (which takes a vector-register operand containing the value to be stored) can take an arbitrary alignment for the memory address? If that's what your point is, then yes I should change the above to allow misaligned float and double nt-stores. But movntss is only storing one float element of the vector register (ignoring the other elems), and so it doesn't allow us to vectorize the case. In short, yes I should change that for float and double nt-stores on SSE4A, but since it doesn't allow us to vectorize, I wonder if I'm misunderstanding your point.

Or are you saying something else? (Like I said, I'm confused.)

wristow updated this revision to Diff 202378.May 31 2019, 12:54 AM

Updated the patch to allow arbitrary alignment of float and double nt-stores for SSE4A.

wristow marked an inline comment as done.May 31 2019, 1:04 AM

Would it be possible to add tests where non-temporal load/stores successfully vectorize?

Glad to see your comment about SSE4A supporting nt-stores at any alignment. With that, I can make an X86 test-case that does vectorize.

Actually, if I understand your SSE4A point correctly, then that doesn't allow vector nt-stores at abrbitrary alignment. So AFAIK, there aren't any vector nt mem-ops on X86, and so for X86, I cannot make a test that successfully vectorizes.

lib/Target/X86/X86TargetTransformInfo.cpp
3080 ↗(On Diff #198918)

I'm thinking your point must be my second guess above (that movntss and movntsd can store float/double non-temporally at an arbitrary boundary). So I've updated the patch to do that.

RKSimon added inline comments.May 31 2019, 1:18 AM
lib/Target/X86/X86TargetTransformInfo.cpp
3080 ↗(On Diff #198918)

Sorry @wristow I missed your previous question - yes I was referring to SSE4A allowing unaligned scalar float/double nt-stores. Regular (v)movntps still has natural alignment requirements.

I also raised PR42026 about using movntss/movntsd/movnti to scalarize unaligned vectors, which I think with suitable costs would still allow us to vectorize everything else - vectorizer would create a unaligned vector ntstore ir instruction and we'd scalarize the store in the backend.

wristow marked an inline comment as done.May 31 2019, 9:23 AM
wristow added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
3080 ↗(On Diff #198918)

Thanks for that explanation, @RKSimon . I understand much better now.

With that, is there still an additional test (for X86) where non-temporal loads/stores are vectorized that is possible? I think PR42026 is essentially attacking that problem from the other end. Thinking about it, would a fix for PR42026 obviate the change here?

RKSimon added inline comments.Jun 4 2019, 9:48 AM
include/llvm/Analysis/TargetTransformInfoImpl.h
227 ↗(On Diff #202378)

I realise this is the current default but its almost certainly better to get this return false - @fhahn any comments?

lib/Target/X86/X86TargetTransformInfo.cpp
3157 ↗(On Diff #202378)

typo

3080 ↗(On Diff #198918)

I think getting this patch in first makes sense.

I see PR42026 more as a failsafe if anything has managed to create unaligned nt-stores - it doesn't help with nt-loads either.

andreadb accepted this revision.Jun 4 2019, 9:50 AM

I think this patch looks good.
The new TTI hooks looks good, and the change seems conservative enough. But more importantly it fixes the perf issue reported as PR40759.

I'll leave the final decision to Simon. However, from my point of view this patch looks good.

lib/Target/X86/X86TargetTransformInfo.cpp
3080 ↗(On Diff #198918)

My understanding is that we still want this change regardless of PR42026.
I was chatting with Simon about this issue. A fix for PR42026 can be seen as some sort of "last resort" if badly aligned NT instructions reach ISel.

This revision is now accepted and ready to land.Jun 4 2019, 9:50 AM
wristow marked 3 inline comments as done.Jun 4 2019, 10:52 AM
wristow added inline comments.
include/llvm/Analysis/TargetTransformInfoImpl.h
227 ↗(On Diff #202378)

FWIW, I have a similar view. I was thinking of rather than returning simply true (or false), returning true only if the specified DataType is minimally aligned at Alignment. In practice for now (since this is currently only called when checking for vectorization), this will effectively mean it defaults to returning false. But if these routines were ever used more generally, checking the alignment of DataType seems like a sensible default. I'll wait a bit to see whether @fhahn has a preference.

lib/Target/X86/X86TargetTransformInfo.cpp
3157 ↗(On Diff #202378)

Thanks. Will fix before committing.

3080 ↗(On Diff #198918)

OK. Sounds good.

This patch was accepted by @andreadb about a week ago, but Andrea said he'd leave the final decision to @RKSimon. With Simon's comments from a week ago asking @fhahn if he had any preferences for keeping the current default as-is, or changing it to something that seems to reflect existing hardware better, I've been hesitant to actually commit it, even though it's marked as Accepted. With a week now gone by, I'll plan on committing this tomorrow, unless Simon or Florian raise any concerns.

fhahn added a comment.Jun 11 2019, 1:12 PM

Sorry I totally missed this review. Overal LGTM, just a few minor comments inline.

include/llvm/Analysis/TargetTransformInfoImpl.h
227 ↗(On Diff #202378)

I agree, I think we should make non-temporal support opt in.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
755 ↗(On Diff #202378)

Can we drop the } else here, as we return in the if () case?

757 ↗(On Diff #202378)

I guess we also want to extend the cost model to ensure the vectorization factor we pick supports NT loads/stores, as a follow-up?

wristow marked 3 inline comments as done.Jun 11 2019, 2:23 PM

Thanks @fhahn. I'll update the patch to address those comments.

include/llvm/Analysis/TargetTransformInfoImpl.h
227 ↗(On Diff #202378)

Sounds good. I suspect that change will require making some test-changes to be compatible with the more conservative default. Assuming it does, I'll update the patch here for any further comments, rather than directly committing. FTR, I'm going to use my intermediate suggestion of checking the whether DataType is minimally aligned at Alignment (rather than returning false), under the assumption that many architectures do support aligned non-temporal references, and if these new utilities were ever used more generically (outside of vectorization checks), then it would be more meaningful to return true in those aligned cases.

lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
755 ↗(On Diff #202378)

Good point. Will do.

757 ↗(On Diff #202378)

That seems like a good idea, but I'm not sure how we'll exercise that code with current architectures we support. That is, I think none of our architectures have support for any misaligned non-temporal vector memory references. So in practice, vectorization in these cases will always be suppressed (and so the arbitrary selection of 2 vector elements would behave identically if, for example, we arbitrarily picked 4 or 8).

wristow updated this revision to Diff 204235.Jun 12 2019, 1:45 AM

Updated patch to address comments from @RKSimon cand @fhahn -- primarily change the default so that nontemporal misaligned mem-ops are assumed to not exist.

wristow marked 4 inline comments as done.Jun 12 2019, 1:46 AM
RKSimon accepted this revision.Jun 12 2019, 2:53 AM

Thanks @wristow LGTM with one minor

include/llvm/Analysis/TargetTransformInfoImpl.h
240 ↗(On Diff #204235)

Might be safer to duplicate the isLegalNTStore code depending on how well targets override both/either of the calls?

wristow marked an inline comment as done.Jun 13 2019, 7:35 PM
wristow added inline comments.
include/llvm/Analysis/TargetTransformInfoImpl.h
240 ↗(On Diff #204235)

Will do. (I'm on vacation right now, but will wrap this up first hing next week.)

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2019, 10:16 AM