This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Improve handling of non-temporal aligned loads
ClosedPublic

Authored by RKSimon on May 31 2017, 9:29 AM.

Details

Summary

PR32743 - Avoid folding of non-temporal aligned loads (when supported by the CPU) into instructions which will cause them to become temporal vector loads and pollute the caches.

PR32744 - Split 256-bit vector non-temporal aligned loads on AVX1 targets to keep them non-temporal.

These can be committed separately but are so inter-related I thought it better to get them reviewed together.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.May 31 2017, 9:29 AM
zvi edited edge metadata.Jun 4 2017, 1:03 PM

LGTM

filcab edited edge metadata.Jun 5 2017, 7:07 AM

LGTM with a minor comment nit (if I'm right).
Code expansion is annoying, but it becomes closer to source semantics.
Thanks!
Filipe

lib/Target/X86/X86ISelLowering.cpp
32303 ↗(On Diff #100877)

"pre-AVX2" (or "targets without AVX2"), no? I'd expect this to also happen on SSE4.1 (also has 128bit NT loads).

RKSimon added inline comments.Jun 5 2017, 7:48 AM
lib/Target/X86/X86ISelLowering.cpp
32303 ↗(On Diff #100877)

Yes, it's just the comment that needs fixing - !hasInt256() will already handle all SSE/AVX1 CPUs

This revision was automatically updated to reflect the committed changes.
craig.topper added inline comments.Jun 5 2017, 8:53 AM
test/CodeGen/X86/nontemporal-loads.ll
642 ↗(On Diff #100877)

Why is sse4.1 still folding here? Is this because sse4.1 patterns uses memopv4f32 and not loadv4f32?

RKSimon added inline comments.Jun 5 2017, 9:04 AM
test/CodeGen/X86/nontemporal-loads.ll
642 ↗(On Diff #100877)

Ah - missed that one - yes its because its using the 'SSE-only' memory fragments.

spatel edited edge metadata.Jun 5 2017, 9:11 AM

Note: AVX1 gets more non-temporal with the current proposal in D33866.

craig.topper added inline comments.Jun 5 2017, 9:13 AM
test/CodeGen/X86/nontemporal-loads.ll
642 ↗(On Diff #100877)

I've wondered for a while if we shouldn't have a converged memop that lifts the alignment restriction if AVX is enabled. I think that would shrink the size of the DAG isel table because AVX and SSE would use the same predicate. The only issue I know of is that SHA1 instructions use memop, but don't have a VEX equivalent documented yet so must always obey the alignment even when AVX is enabled.

I've wondered for a while if we shouldn't have a converged memop that lifts the alignment restriction if AVX is enabled. I think that would shrink the size of the DAG isel table because AVX and SSE would use the same predicate. The only issue I know of is that SHA1 instructions use memop, but don't have a VEX equivalent documented yet so must always obey the alignment even when AVX is enabled.

@craig.topper Yes - it does look like we could start by having memop inherit from vec128load (simplifying the repeated non-temporal logic I'm about to add), there are probably other cases like that as well. Or we could just go for the SSE/AVX refactor straight away?

Not perfect but we could get around the SHA issue by only folding alignedloadv2i64 and add a couple of extra folding patterns to the SHA instructions for the hasSSEUnalignedMem cases? It'd be messy though...