This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
8/8
LegalizeVectorTypes.cpp
-
test/CodeGen/
-
CodeGen/
-
AMDGPU/
4
kernel-args.ll
-
X86/
-
load-partial-dot-product.ll

Differential D106447

[DAGCombine] DAGTypeLegalizer::GenWidenVectorLoads(): make use of dereferenceability knowledge
AbandonedPublic

Authored by lebedev.ri on Jul 21 2021, 7:47 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
craig.topper
pengfei
arsenm
sebastian-ne
foad
efriedma

Summary

In @dot3_float3, while we know that the pointer has sufficient dereferenceable bytes,
the load itself is not aligned sufficiently, and that is the only number
that is currently used to determine legality of performing the wide load.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Jul 21 2021, 7:47 AM

Herald added subscribers: kerbowa, hiraditya, nhaehnle, jvesely. · View Herald TranscriptJul 21 2021, 7:47 AM

lebedev.ri requested review of this revision.Jul 21 2021, 7:47 AM

pengfei added inline comments.Jul 21 2021, 8:05 AM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5286	Is this a trick that turning unaligned memory to aligned if the widen bits are dereferenceable? The problem I can think is we may mistakenly generate `movaps` on an unaligned address which will crush in runtime. Did I misunderstand something here? Because I saw the tests are using `movups` in fact.

Thanks for taking a look!

Make it more obvious that it's not really an alignment, but a count of known-dereferenceable bytes,
and fix another pessimization in process.

Herald added a subscriber: dmgreen. · View Herald TranscriptJul 21 2021, 8:26 AM

lebedev.ri added inline comments.Jul 21 2021, 8:26 AM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5286	Note that `LdAlign` is not used as the alignment for the new load. It's confusingly named as such, and is only used by `FindMemType` as a measure of how many bytes are known-dereferenceable. Hopefully this is better.

RKSimon added reviewers: arsenm, sebastian-ne, foad.Jul 21 2021, 8:47 AM

Herald added a subscriber: wdng. · View Herald TranscriptJul 21 2021, 8:47 AM

foad added inline comments.Jul 21 2021, 8:58 AM

llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗	(On Diff #360472)	This looks alarming. It's loading 32 bytes instead of 8.

RKSimon added inline comments.Jul 21 2021, 9:09 AM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5346	do we have to reduce NumDereferenecableBytes in the loop?

lebedev.ri added inline comments.Jul 21 2021, 9:12 AM

llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll

165 ↗

(On Diff #360472)

Could you please be more specific, do you believe this is a correctness concern, or a performance one?
Because i do believe this is correct:

Legalizing node: t67: v10i32,ch = load<(dereferenceable invariant load (s320) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64
Analyzing result type: v10i32
Widen node result 0: t67: v10i32,ch = load<(dereferenceable invariant load (s320) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64

NumDereferenceableBytes 40
Creating new node: t86: v8i32,ch = load<(dereferenceable invariant load (s256) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64
Creating constant: t87: i64 = Constant<32>
Creating new node: t88: i64 = add nuw t28, Constant:i64<32>
Creating new node: t89: v8i32,ch = load<(dereferenceable invariant load (s256) from %ir.1 + 32, align 4, addrspace 4)> t0, t88, undef:i64
Creating new node: t90: v16i32 = concat_vectors t86, t89
Creating new node: t91: ch = TokenFactor t86:1, t89:1

*** IR Dump Before Module Verifier (verify) *** (function: test_copy_v4i8_x4)
; ModuleID = '/tmp/test.ll'
source_filename = "/tmp/test.ll"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"

; Function Attrs: nounwind
define amdgpu_kernel void @test_copy_v4i8_x4(<4 x i8> addrspace(1)* %out0, <4 x i8> addrspace(1)* %out1, <4 x i8> addrspace(1)* %out2, <4 x i8> addrspace(1)* %out3, <4 x i8> addrspace(1)* %in) #0 {
  %test_copy_v4i8_x4.kernarg.segment = call nonnull align 16 dereferenceable(76) i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()
  %out0.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 36
  %out0.kernarg.offset.cast = bitcast i8 addrspace(4)* %out0.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %1 = bitcast <4 x i8> addrspace(1)* addrspace(4)* %out0.kernarg.offset.cast to <5 x i64> addrspace(4)*, !amdgpu.uniform !0
  %2 = load <5 x i64>, <5 x i64> addrspace(4)* %1, align 4, !invariant.load !0
  %out0.load1 = extractelement <5 x i64> %2, i32 0
  %3 = inttoptr i64 %out0.load1 to <4 x i8> addrspace(1)*
  %out1.load2 = extractelement <5 x i64> %2, i32 1
  %4 = inttoptr i64 %out1.load2 to <4 x i8> addrspace(1)*
  %out2.load3 = extractelement <5 x i64> %2, i32 2
  %5 = inttoptr i64 %out2.load3 to <4 x i8> addrspace(1)*
  %out3.load4 = extractelement <5 x i64> %2, i32 3
  %6 = inttoptr i64 %out3.load4 to <4 x i8> addrspace(1)*
  %in.load5 = extractelement <5 x i64> %2, i32 4
  %7 = inttoptr i64 %in.load5 to <4 x i8> addrspace(1)*
  %out1.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 44
  %out1.kernarg.offset.cast = bitcast i8 addrspace(4)* %out1.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %out2.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 52
  %out2.kernarg.offset.cast = bitcast i8 addrspace(4)* %out2.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %out3.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 60
  %out3.kernarg.offset.cast = bitcast i8 addrspace(4)* %out3.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %in.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 68
  %in.kernarg.offset.cast = bitcast i8 addrspace(4)* %in.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %tid.x = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
  %idxprom = sext i32 %tid.x to i64
  %gep = getelementptr <4 x i8>, <4 x i8> addrspace(1)* %7, i64 %idxprom
  %val = load <4 x i8>, <4 x i8> addrspace(1)* %gep, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %3, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %4, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %5, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %6, align 4
  ret void
}

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #1

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.y() #1

; Function Attrs: nounwind readnone speculatable willreturn
declare align 4 i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() #2

; Function Attrs: convergent nounwind willreturn
declare { i1, i64 } @llvm.amdgcn.if.i64(i1) #3

; Function Attrs: convergent nounwind willreturn
declare { i1, i64 } @llvm.amdgcn.else.i64.i64(i64) #3

; Function Attrs: convergent nounwind readnone willreturn
declare i64 @llvm.amdgcn.if.break.i64(i1, i64) #4

; Function Attrs: convergent nounwind willreturn
declare i1 @llvm.amdgcn.loop.i64(i64) #3

; Function Attrs: convergent nounwind willreturn
declare void @llvm.amdgcn.end.cf.i64(i64) #3

attributes #0 = { nounwind "amdgpu-memory-bound"="true" "amdgpu-wave-limiter"="true" "target-cpu"="tahiti" }
attributes #1 = { nounwind readnone speculatable willreturn "target-cpu"="tahiti" }
attributes #2 = { nounwind readnone speculatable willreturn }
attributes #3 = { convergent nounwind willreturn }
attributes #4 = { convergent nounwind readnone willreturn }

!0 = !{}
!1 = !{i32 0, i32 1024}

Don't forget to decrease the dereferenceable size in loop.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5346	Oh, hmm, yes i do believe we do indeed, good spot!

foad added inline comments.Jul 21 2021, 9:25 AM

llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗	(On Diff #360472)	I meant it looks like an alarming performance regression. I have no reason to believe it's incorrect.

lebedev.ri added inline comments.Jul 21 2021, 10:49 AM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5350–5352	This really doesn't look right
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗	(On Diff #360472)	I'm open to suggestions.

Harbormaster completed remote builds in B115348: Diff 360497.Jul 21 2021, 11:04 AM

RKSimon added inline comments.Jul 21 2021, 11:38 AM

llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗	(On Diff #360472)	The problem you're going to hit is FindMemType searches for the biggest possible valid type - and the dereferencable change has relaxed that validity check even more - you're probably going to have to get FindMemType to continue searching the smaller types until it fails a second validity test?

efriedma added a subscriber: efriedma.Jul 21 2021, 11:43 AM

efriedma added inline comments.

llvm/test/CodeGen/X86/widen_load-3.ll
118 ↗	(On Diff #360497)	How are we proving this transform is legal?

Attempt to fix issues.
Unless i'm doing it wrong, this seems to have fixed regressions,
and maybe fixed a few existing miscompiles?

... and actually avoid regressions.
I think this is it.

lebedev.ri added inline comments.Jul 21 2021, 2:00 PM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5330	Ugh, actually, this is still not quite right, because it should be the alignment of the load we will produce now, not of the one we just produced.
5350–5352	Disregard, alignment tracking in SDAG is confusing.

Ok, now *this* should be it.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5330	This is rather horrible :(

efriedma added inline comments.Jul 21 2021, 2:44 PM

llvm/test/CodeGen/ARM/vector-load.ll
257 ↗	(On Diff #360594)	We probably shouldn't be widening volatile loads like this. Not that this particular testcase reflects anything useful, but still.

And one more regression..

lebedev.ri marked an inline comment as done.Jul 21 2021, 2:49 PM

lebedev.ri added inline comments.

llvm/test/CodeGen/ARM/vector-load.ll
257 ↗	(On Diff #360594)	Indeed. Looks like we discovered this bug at the same time.

@foad thoughts on remaining AMDGPU changes?

Harbormaster completed remote builds in B115424: Diff 360608.Jul 21 2021, 4:42 PM

Hi,
we talked a bit about this internally and found some potential problems.
There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable, we do not want to widen loads for amdgpu (at least not in cases where the load gets a lot larger, i.e. loading 64 Bytes instead of 32). Widening a load poses more restrictions on the register allocator, as a larger consecutive set of registers needs to be allocated. And, in the case that a widened load hits more cache lines than before, it also consumes more memory bandwidth.

Thanks for taking a look!

In D106447#2899434, @sebastian-ne wrote:

Hi,
we talked a bit about this internally and found some potential problems.

In D106447#2899434, @sebastian-ne wrote:

There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable,

This is FUD.

/// isDereferenceable - Return true if V is always dereferenceable for
/// Offset + Size byte.
bool MachinePointerInfo::isDereferenceable(<...>

/// Returns true if V is always dereferenceable for Size byte with alignment
/// greater or equal than requested. If the context instruction is specified
/// performs context-sensitive analysis and returns true if the pointer is
/// dereferenceable at the specified instruction.
bool isDereferenceableAndAlignedPointer(<...>

/// Check if executing a load of this pointer value cannot trap.
///
/// If DT and ScanFrom are specified this method performs context-sensitive
/// analysis and returns true if it is safe to load immediately before ScanFrom.
///
/// If it is not obviously safe to load from the specified pointer, we do
/// a quick local scan of the basic block containing \c ScanFrom, to determine
/// if the address is already accessed.
///
/// This uses the pointee type to determine how many bytes need to be safe to
/// load from the pointer.
bool llvm::isSafeToLoadUnconditionally(Value *V, Align Alignment, APInt &Size,
                                       const DataLayout &DL,
                                       Instruction *ScanFrom,
                                       const DominatorTree *DT,
                                       const TargetLibraryInfo *TLI) {
  // If DT is not specified we can't make context-sensitive query
  const Instruction* CtxI = DT ? ScanFrom : nullptr;
  if (isDereferenceableAndAlignedPointer(V, Alignment, Size, DL, CtxI, DT, TLI))
    return true;

We can not second-guess LLVM semantics.
If we can tell that we are always allowed to load N bytes without trapping, then we can do that.
If we actually can't, then the problem is elsewhere, we shouldn't have deduced that we can.

we do not want to widen loads for amdgpu (at least not in cases where the load gets a lot larger, i.e. loading 64 Bytes instead of 32). Widening a load poses more restrictions on the register allocator, as a larger consecutive set of registers needs to be allocated. And, in the case that a widened load hits more cache lines than before, it also consumes more memory bandwidth.

Ok, so we need a target hook, that's doable.

In D106447#2899456, @lebedev.ri wrote:

In D106447#2899434, @sebastian-ne wrote:

There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable,

This is FUD.

…

We can not second-guess LLVM semantics.
If we can tell that we are always allowed to load N bytes without trapping, then we can do that.
If we actually can't, then the problem is elsewhere, we shouldn't have deduced that we can.

I agree with that, I misunderstand the code then.

if (LD->isSimple()) {
  NumDereferenceableBytes = LD->getAlignment();
  if (!LdWidth.isScalable())
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, LdWidth / 8);
  if (!WidenWidth.isScalable() && NumDereferenceableBytes < WidenWidth / 8 &&
      LD->getPointerInfo().isDereferenceable(
          WidenWidth / 8, *DAG.getContext(), DAG.getDataLayout()))
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, WidenWidth / 8);
}

As far as I understand this part, NumDereferenceableBytes is set to the (guaranteed minimum) alignment of the load.
In some cases, where it’s ok, e.g. because the dereferenceable size is set, NumDereferenceableBytes is increased.

So, NumDereferenceableBytes is always at least the alignment?
isDereferenceable(LD->getAlignment()) is never checked as far as I see.

CC @nlopes @aqjune - is alive2 correct for LLVM IR semantics?

In D106447#2899471, @sebastian-ne wrote:
In D106447#2899456, @lebedev.ri wrote:

In D106447#2899434, @sebastian-ne wrote:

There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable,

This is FUD.

…

We can not second-guess LLVM semantics.
If we can tell that we are always allowed to load N bytes without trapping, then we can do that.
If we actually can't, then the problem is elsewhere, we shouldn't have deduced that we can.

I agree with that, I misunderstand the code then.
if (LD->isSimple()) {
  NumDereferenceableBytes = LD->getAlignment();
  if (!LdWidth.isScalable())
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, LdWidth / 8);
  if (!WidenWidth.isScalable() && NumDereferenceableBytes < WidenWidth / 8 &&
      LD->getPointerInfo().isDereferenceable(
          WidenWidth / 8, *DAG.getContext(), DAG.getDataLayout()))
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, WidenWidth / 8);
}
As far as I understand this part, NumDereferenceableBytes is set to the (guaranteed minimum) alignment of the load.

That is true, and please note that it is already what is happening regardless of this patch, as you can see from the LHS of the diff.
Looks like that was most originally added by rL94338.

Now that i look, indeed, while deref is the source of truth (https://alive2.llvm.org/ce/z/9TAAwj vs https://alive2.llvm.org/ce/z/bmKMm4),
as per the LLVM IR semantics as modelled by alive2, alignment does not guarantee dereferenceability: https://alive2.llvm.org/ce/z/-HxoLA
(CC @nlopes @aqjune)

In some cases, where it’s ok, e.g. because the dereferenceable size is set, NumDereferenceableBytes is increased.

So, NumDereferenceableBytes is always at least the alignment?
isDereferenceable(LD->getAlignment()) is never checked as far as I see.

Right, but that is so regardless of this patch.

@lebedev.ri Yep, Alive2's encoding assumes that dereferenceable's encoding has no relation with align value.
LangRef does not seem to impose that a pointer must be dereferenceable by at least its alignment.

Hmm, that's, interesting.
So how does the current existing transform motivate the legality of what it does i wonder?

I guess the motivation was that the assumption was valid for certain architectures (I'm not an expert in this area though)
But, I agree that such information is valuable not only in codegen but also for middle-end transformations (e.g. attributor).

Alignment implying lower granularity of "dereferenceable" isn't justified in LangRef anywhere. In practice, it works out on common targets because it's impossible to allocate less than a page of memory. (At least, in userspace...)

It's something that would be nice to preserve in the cases where it's justified, but maybe not worth the complexity of trying to track the relevant page size on every target.

foad added inline comments.Jul 29 2021, 2:29 AM

llvm/test/CodeGen/AMDGPU/kernel-args.ll
2266–2402	I think this diff is misleading. I've pushed an alternative version here, which is based on first autogenerating the checks in this file before applying D106447: https://github.com/jayfoad/llvm-project/commit/ed4a53da7c265675510caeb2c773f1b461354347 In this particular case (v5i64_arg for SI) the relevant part of the diff is: -; SI-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x19 -; SI-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0x9 -; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x21 -; SI-NEXT: s_mov_b32 s15, 0xf000 -; SI-NEXT: s_mov_b32 s14, -1 +; SI-NEXT: s_load_dwordx16 s[4:19], s[0:1], 0x19 +; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9 +; SI-NEXT: s_mov_b32 s3, 0xf000 +; SI-NEXT: s_mov_b32 s2, -1 I.e. the load from scaled offset 0x9 is untouched; the significant difference is that the dwordx8 and dwordx2 load from offsets 0x19 and 0x21 have been replaced by a single dwordx16 load from offset 0x19.

lebedev.ri added inline comments.Jul 29 2021, 2:35 AM

llvm/test/CodeGen/AMDGPU/kernel-args.ll
2266–2402	That's why nobody likes manually-written check lines.. Let me rebase this..

lebedev.ri added inline comments.Jul 29 2021, 2:46 AM

llvm/test/CodeGen/AMDGPU/kernel-args.ll
2266–2402	Actually, i see that you have not pushed https://github.com/llvm/llvm-project/commit/00b0f1ed7d5833695ba266854ee1d07467ddcf9c, so there isn't much point in rebasing. Do you want to push that one?

foad added inline comments.Jul 29 2021, 3:01 AM

llvm/test/CodeGen/AMDGPU/kernel-args.ll
2266–2402	Pushed for review as D107052 because I had to make some non-trivial changes to the RUN lines.

lebedev.ri mentioned this in D107052: [AMDGPU] Autogenerate checks in kernel-args.ll.Jul 30 2021, 7:40 AM

Rebased after D107052 (thanks!)
I think the AMDGPU test change looks somewhat less scary now.
Thoughts?

Harbormaster completed remote builds in B117252: Diff 363194.Jul 30 2021, 2:17 PM

In D106447#2917507, @lebedev.ri wrote:

Rebased after D107052 (thanks!)
I think the AMDGPU test change looks somewhat less scary now.
Thoughts?

I think we will still need a target hook or some other way of restricting this. In the v5i64_arg case I quoted earlier we're loading 16 dwords instead of 10. Yes the 16-dword load is legal, but that doesn't mean it's a good idea to use it. It increases register pressure (which is especially bad on GPUs) and may load from more cache lines.

RKSimon mentioned this in D106280: [X86][AVX] scalar_to_vector(load_scalar()) -> load_vector() for fast dereferencable loads.Aug 4 2021, 6:33 AM

lebedev.ri abandoned this revision.Jan 17 2022, 2:37 PM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

LegalizeVectorTypes.cpp

63 lines

test/

CodeGen/

AMDGPU/

kernel-args.ll

198 lines

X86/

load-partial-dot-product.ll

33 lines

Diff 363194

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 5,138 Lines • ▼ Show 20 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Vector Widening Utilities		// Vector Widening Utilities
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// Utility function to find the type to chop up a widen vector for load/store		// Utility function to find the type to chop up a widen vector for load/store
// TLI: Target lowering used to determine legal types.		// TLI: Target lowering used to determine legal types.
// Width: Width left need to load/store.		// Width: Width left need to load/store.
// WidenVT: The widen vector type to load to/store from		// WidenVT: The widen vector type to load to/store from
// Align: If 0, don't allow use of a wider type		// NumDereferenceableBytes: If 0, don't allow use of a wider type
// WidenEx: If Align is not 0, the amount additional we can load/store from.		// WidenEx: If NumDereferenceableBytes is not 0,
		// the additional amount we have to load/store.

static EVT FindMemType(SelectionDAG& DAG, const TargetLowering &TLI,		static EVT FindMemType(SelectionDAG &DAG, const TargetLowering &TLI,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'FindMemType' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'FindMemType' [readability-identifier…
unsigned Width, EVT WidenVT,		unsigned Width, EVT WidenVT,
unsigned Align = 0, unsigned WidenEx = 0) {		unsigned NumDereferenceableBytes = 0,
		unsigned WidenEx = 0) {
EVT WidenEltVT = WidenVT.getVectorElementType();		EVT WidenEltVT = WidenVT.getVectorElementType();
const bool Scalable = WidenVT.isScalableVector();		const bool Scalable = WidenVT.isScalableVector();
unsigned WidenWidth = WidenVT.getSizeInBits().getKnownMinSize();		unsigned WidenWidth = WidenVT.getSizeInBits().getKnownMinSize();
unsigned WidenEltWidth = WidenEltVT.getSizeInBits();		unsigned WidenEltWidth = WidenEltVT.getSizeInBits();
unsigned AlignInBits = Align*8;		unsigned NumDereferenceableBits = NumDereferenceableBytes * 8;

// If we have one element to load/store, return it.		// If we have one element to load/store, return it.
EVT RetVT = WidenEltVT;		EVT RetVT = WidenEltVT;
if (!Scalable && Width == WidenEltWidth)		if (!Scalable && Width == WidenEltWidth)
return RetVT;		return RetVT;

// Don't bother looking for an integer type if the vector is scalable, skip		// Don't bother looking for an integer type if the vector is scalable, skip
// to vector types.		// to vector types.
if (!Scalable) {		if (!Scalable) {
// See if there is larger legal integer than the element type to load/store.		// See if there is larger legal integer than the element type to load/store.
for (EVT MemVT : reverse(MVT::integer_valuetypes())) {		for (EVT MemVT : reverse(MVT::integer_valuetypes())) {
unsigned MemVTWidth = MemVT.getSizeInBits();		unsigned MemVTWidth = MemVT.getSizeInBits();
if (MemVT.getSizeInBits() <= WidenEltWidth)		if (MemVT.getSizeInBits() <= WidenEltWidth)
break;		break;
auto Action = TLI.getTypeAction(*DAG.getContext(), MemVT);		auto Action = TLI.getTypeAction(*DAG.getContext(), MemVT);
if ((Action == TargetLowering::TypeLegal \|\|		if ((Action == TargetLowering::TypeLegal \|\|
Action == TargetLowering::TypePromoteInteger) &&		Action == TargetLowering::TypePromoteInteger) &&
(WidenWidth % MemVTWidth) == 0 &&		(WidenWidth % MemVTWidth) == 0 &&
isPowerOf2_32(WidenWidth / MemVTWidth) &&		isPowerOf2_32(WidenWidth / MemVTWidth) &&
(MemVTWidth <= Width \|\|		(MemVTWidth <= Width \|\| (NumDereferenceableBytes != 0 &&
(Align!=0 && MemVTWidth<=AlignInBits && MemVTWidth<=Width+WidenEx))) {		MemVTWidth <= NumDereferenceableBits &&
		MemVTWidth <= Width + WidenEx))) {
if (MemVTWidth == WidenWidth)		if (MemVTWidth == WidenWidth)
return MemVT;		return MemVT;
RetVT = MemVT;		RetVT = MemVT;
break;		break;
}		}
}		}
}		}

// See if there is a larger vector type to load/store that has the same vector		// See if there is a larger vector type to load/store that has the same vector
// element type and is evenly divisible with the WidenVT.		// element type and is evenly divisible with the WidenVT.
for (EVT MemVT : reverse(MVT::vector_valuetypes())) {		for (EVT MemVT : reverse(MVT::vector_valuetypes())) {
// Skip vector MVTs which don't match the scalable property of WidenVT.		// Skip vector MVTs which don't match the scalable property of WidenVT.
if (Scalable != MemVT.isScalableVector())		if (Scalable != MemVT.isScalableVector())
continue;		continue;
unsigned MemVTWidth = MemVT.getSizeInBits().getKnownMinSize();		unsigned MemVTWidth = MemVT.getSizeInBits().getKnownMinSize();
auto Action = TLI.getTypeAction(*DAG.getContext(), MemVT);		auto Action = TLI.getTypeAction(*DAG.getContext(), MemVT);
if ((Action == TargetLowering::TypeLegal \|\|		if ((Action == TargetLowering::TypeLegal \|\|
Action == TargetLowering::TypePromoteInteger) &&		Action == TargetLowering::TypePromoteInteger) &&
WidenEltVT == MemVT.getVectorElementType() &&		WidenEltVT == MemVT.getVectorElementType() &&
(WidenWidth % MemVTWidth) == 0 &&		(WidenWidth % MemVTWidth) == 0 &&
isPowerOf2_32(WidenWidth / MemVTWidth) &&		isPowerOf2_32(WidenWidth / MemVTWidth) &&
(MemVTWidth <= Width \|\|		(MemVTWidth <= Width \|\| (NumDereferenceableBytes != 0 &&
(Align!=0 && MemVTWidth<=AlignInBits && MemVTWidth<=Width+WidenEx))) {		MemVTWidth <= NumDereferenceableBits &&
		MemVTWidth <= Width + WidenEx))) {
if (RetVT.getFixedSizeInBits() < MemVTWidth \|\| MemVT == WidenVT)		if (RetVT.getFixedSizeInBits() < MemVTWidth \|\| MemVT == WidenVT)
return MemVT;		return MemVT;
}		}
}		}

if (Scalable)		if (Scalable)
report_fatal_error("Using element-wise loads and stores for widening "		report_fatal_error("Using element-wise loads and stores for widening "
"operations is not supported for scalable vectors");		"operations is not supported for scalable vectors");
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	SDValue DAGTypeLegalizer::GenWidenVectorLoads(SmallVectorImpl<SDValue> &LdChain,
SDValue Chain = LD->getChain();		SDValue Chain = LD->getChain();
SDValue BasePtr = LD->getBasePtr();		SDValue BasePtr = LD->getBasePtr();
MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();		MachineMemOperand::Flags MMOFlags = LD->getMemOperand()->getFlags();
AAMDNodes AAInfo = LD->getAAInfo();		AAMDNodes AAInfo = LD->getAAInfo();

TypeSize LdWidth = LdVT.getSizeInBits();		TypeSize LdWidth = LdVT.getSizeInBits();
TypeSize WidenWidth = WidenVT.getSizeInBits();		TypeSize WidenWidth = WidenVT.getSizeInBits();
TypeSize WidthDiff = WidenWidth - LdWidth;		TypeSize WidthDiff = WidenWidth - LdWidth;
// Allow wider loads if they are sufficiently aligned to avoid memory faults		unsigned NumDereferenceableBytes = 0;
// and if the original load is simple.		// Allow wider loads if the original load is simple and we can dereference
unsigned LdAlign = (!LD->isSimple()) ? 0 : LD->getAlignment();		// padding bytes.
		if (LD->isSimple()) {
		NumDereferenceableBytes = LD->getAlignment();
		if (!LdWidth.isScalable())
		NumDereferenceableBytes =
		std::max<unsigned>(NumDereferenceableBytes, LdWidth / 8);
		if (!WidenWidth.isScalable() && NumDereferenceableBytes < WidenWidth / 8 &&
		LD->getPointerInfo().isDereferenceable(
		WidenWidth / 8, *DAG.getContext(), DAG.getDataLayout()))
		NumDereferenceableBytes =
		std::max<unsigned>(NumDereferenceableBytes, WidenWidth / 8);
		}

// Find the vector type that can load from.		// Find the vector type that can load from.
		pengfeiUnsubmitted Done Reply Inline Actions Is this a trick that turning unaligned memory to aligned if the widen bits are dereferenceable? The problem I can think is we may mistakenly generate `movaps` on an unaligned address which will crush in runtime. Did I misunderstand something here? Because I saw the tests are using `movups` in fact. pengfei: Is this a trick that turning unaligned memory to aligned if the widen bits are dereferenceable?
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Note that `LdAlign` is not used as the alignment for the new load. It's confusingly named as such, and is only used by `FindMemType` as a measure of how many bytes are known-dereferenceable. Hopefully this is better. lebedev.ri: Note that `LdAlign` is not used as the alignment for the new load. It's confusingly named as…
EVT NewVT = FindMemType(DAG, TLI, LdWidth.getKnownMinSize(), WidenVT, LdAlign,		EVT NewVT = FindMemType(DAG, TLI, LdWidth.getKnownMinSize(), WidenVT,
WidthDiff.getKnownMinSize());		NumDereferenceableBytes, WidthDiff.getKnownMinSize());
TypeSize NewVTWidth = NewVT.getSizeInBits();		TypeSize NewVTWidth = NewVT.getSizeInBits();
SDValue LdOp = DAG.getLoad(NewVT, dl, Chain, BasePtr, LD->getPointerInfo(),		SDValue LdOp = DAG.getLoad(NewVT, dl, Chain, BasePtr, LD->getPointerInfo(),
LD->getOriginalAlign(), MMOFlags, AAInfo);		LD->getOriginalAlign(), MMOFlags, AAInfo);
LdChain.push_back(LdOp.getValue(1));		LdChain.push_back(LdOp.getValue(1));

// Check if we can load the element with one instruction.		// Check if we can load the element with one instruction.
if (TypeSize::isKnownLE(LdWidth, NewVTWidth)) {		if (TypeSize::isKnownLE(LdWidth, NewVTWidth)) {
if (!NewVT.isVector()) {		if (!NewVT.isVector()) {
Show All 19 Lines	SDValue DAGTypeLegalizer::GenWidenVectorLoads(SmallVectorImpl<SDValue> &LdChain,
// Load vector by using multiple loads from largest vector to scalar.		// Load vector by using multiple loads from largest vector to scalar.
SmallVector<SDValue, 16> LdOps;		SmallVector<SDValue, 16> LdOps;
LdOps.push_back(LdOp);		LdOps.push_back(LdOp);

uint64_t ScaledOffset = 0;		uint64_t ScaledOffset = 0;
MachinePointerInfo MPI = LD->getPointerInfo();		MachinePointerInfo MPI = LD->getPointerInfo();
do {		do {
LdWidth -= NewVTWidth;		LdWidth -= NewVTWidth;
		if (LD->isSimple()) {
		if (!NewVTWidth.isScalable()) {
		if (NumDereferenceableBytes > NewVTWidth / 8)
		NumDereferenceableBytes -= NewVTWidth / 8;
		else
		NumDereferenceableBytes = 0;
		NumDereferenceableBytes = std::max<unsigned>(
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Ugh, actually, this is still not quite right, because it should be the alignment of the load we will produce now, not of the one we just produced. lebedev.ri: Ugh, actually, this is still not quite right, because it should be the alignment of the load we…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions This is rather horrible :( lebedev.ri: This is rather horrible :(
		NumDereferenceableBytes,
		commonAlignment(cast<LoadSDNode>(LdOp)->getOriginalAlign(),
		cast<LoadSDNode>(LdOp)->getSrcValueOffset() +
		NewVTWidth)
		.value());
		} else
		NumDereferenceableBytes = 0; // FIXME
		}

IncrementPointer(cast<LoadSDNode>(LdOp), NewVT, MPI, BasePtr,		IncrementPointer(cast<LoadSDNode>(LdOp), NewVT, MPI, BasePtr,
&ScaledOffset);		&ScaledOffset);

if (TypeSize::isKnownLT(LdWidth, NewVTWidth)) {		if (TypeSize::isKnownLT(LdWidth, NewVTWidth)) {
// The current type we are using is too large. Find a better size.		// The current type we are using is too large. Find a better size.
NewVT = FindMemType(DAG, TLI, LdWidth.getKnownMinSize(), WidenVT, LdAlign,		NewVT = FindMemType(DAG, TLI, LdWidth.getKnownMinSize(), WidenVT,
WidthDiff.getKnownMinSize());		NumDereferenceableBytes, WidthDiff.getKnownMinSize());
		RKSimonUnsubmitted Done Reply Inline Actions do we have to reduce NumDereferenecableBytes in the loop? RKSimon: do we have to reduce NumDereferenecableBytes in the loop?
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Oh, hmm, yes i do believe we do indeed, good spot! lebedev.ri: Oh, hmm, yes i do believe we do indeed, good spot!
NewVTWidth = NewVT.getSizeInBits();		NewVTWidth = NewVT.getSizeInBits();
}		}

Align NewAlign = ScaledOffset == 0		Align NewAlign = ScaledOffset == 0
? LD->getOriginalAlign()		? LD->getOriginalAlign()
: commonAlignment(LD->getAlign(), ScaledOffset);		: commonAlignment(LD->getAlign(), ScaledOffset);
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions This really doesn't look right lebedev.ri: This really doesn't look right
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Disregard, alignment tracking in SDAG is confusing. lebedev.ri: Disregard, alignment tracking in SDAG is confusing.
SDValue L =		SDValue L =
DAG.getLoad(NewVT, dl, Chain, BasePtr, MPI, NewAlign, MMOFlags, AAInfo);		DAG.getLoad(NewVT, dl, Chain, BasePtr, MPI, NewAlign, MMOFlags, AAInfo);
LdChain.push_back(L.getValue(1));		LdChain.push_back(L.getValue(1));

LdOps.push_back(L);		LdOps.push_back(L);
LdOp = L;		LdOp = L;
} while (TypeSize::isKnownGT(LdWidth, NewVTWidth));		} while (TypeSize::isKnownGT(LdWidth, NewVTWidth));

▲ Show 20 Lines • Show All 268 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/kernel-args.ll

	Show First 20 Lines • Show All 1,690 Lines • ▼ Show 20 Lines
	entry:			entry:
	store <4 x float> %in, <4 x float> addrspace(1)* %out, align 4			store <4 x float> %in, <4 x float> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	define amdgpu_kernel void @v5i8_arg(<5 x i8> addrspace(1)* nocapture %out, <5 x i8> %in) nounwind {			define amdgpu_kernel void @v5i8_arg(<5 x i8> addrspace(1)* nocapture %out, <5 x i8> %in) nounwind {
	; SI-LABEL: v5i8_arg:			; SI-LABEL: v5i8_arg:
	; SI: ; %bb.0: ; %entry			; SI: ; %bb.0: ; %entry
	; SI-NEXT: s_load_dword s2, s[0:1], 0xc			; SI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0xb
	; SI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x9			; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9
	; SI-NEXT: s_load_dword s0, s[0:1], 0xb			; SI-NEXT: s_mov_b32 s3, 0xf000
	; SI-NEXT: s_mov_b32 s7, 0xf000			; SI-NEXT: s_mov_b32 s2, -1
	; SI-NEXT: s_mov_b32 s6, -1
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s2			; SI-NEXT: v_mov_b32_e32 v0, s5
	; SI-NEXT: buffer_store_byte v0, off, s[4:7], 0 offset:4			; SI-NEXT: buffer_store_byte v0, off, s[0:3], 0 offset:4
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s0			; SI-NEXT: v_mov_b32_e32 v0, s4
	; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0			; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: v5i8_arg:			; VI-LABEL: v5i8_arg:
	; VI: ; %bb.0: ; %entry			; VI: ; %bb.0: ; %entry
	; VI-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24			; VI-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
	; VI-NEXT: s_load_dword s4, s[0:1], 0x2c			; VI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x2c
	; VI-NEXT: s_load_dword s1, s[0:1], 0x30
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: s_add_u32 s0, s2, 4			; VI-NEXT: s_add_u32 s4, s2, 4
				; VI-NEXT: s_addc_u32 s5, s3, 0
				; VI-NEXT: v_mov_b32_e32 v2, s4
	; VI-NEXT: v_mov_b32_e32 v0, s2			; VI-NEXT: v_mov_b32_e32 v0, s2
	; VI-NEXT: v_mov_b32_e32 v4, s1			; VI-NEXT: v_mov_b32_e32 v4, s1
	; VI-NEXT: s_addc_u32 s1, s3, 0			; VI-NEXT: v_mov_b32_e32 v3, s5
	; VI-NEXT: v_mov_b32_e32 v3, s1
	; VI-NEXT: v_mov_b32_e32 v2, s0
	; VI-NEXT: flat_store_byte v[2:3], v4
	; VI-NEXT: v_mov_b32_e32 v1, s3			; VI-NEXT: v_mov_b32_e32 v1, s3
	; VI-NEXT: v_mov_b32_e32 v2, s4			; VI-NEXT: v_mov_b32_e32 v5, s0
	; VI-NEXT: flat_store_dword v[0:1], v2			; VI-NEXT: flat_store_byte v[2:3], v4
				; VI-NEXT: flat_store_dword v[0:1], v5
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: v5i8_arg:			; GFX9-LABEL: v5i8_arg:
	; GFX9: ; %bb.0: ; %entry			; GFX9: ; %bb.0: ; %entry
	; GFX9-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0			; GFX9-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0
	; GFX9-NEXT: s_load_dwordx2 s[2:3], s[4:5], 0x8			; GFX9-NEXT: s_load_dwordx2 s[2:3], s[4:5], 0x8
	; GFX9-NEXT: v_mov_b32_e32 v0, 0			; GFX9-NEXT: v_mov_b32_e32 v0, 0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	▲ Show 20 Lines • Show All 524 Lines • ▼ Show 20 Lines
	; CM-NEXT: MOV * T0.Y, KC0[4].Z,			; CM-NEXT: MOV * T0.Y, KC0[4].Z,
	; CM-NEXT: MOV * T0.X, KC0[4].Y,			; CM-NEXT: MOV * T0.X, KC0[4].Y,
	; CM-NEXT: LSHR * T3.X, KC0[2].Y, literal.x,			; CM-NEXT: LSHR * T3.X, KC0[2].Y, literal.x,
	; CM-NEXT: 2(2.802597e-45), 0(0.000000e+00)			; CM-NEXT: 2(2.802597e-45), 0(0.000000e+00)
	entry:			entry:
	store <5 x float> %in, <5 x float> addrspace(1)* %out, align 4			store <5 x float> %in, <5 x float> addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	define amdgpu_kernel void @v5i64_arg(<5 x i64> addrspace(1)* nocapture %out, <5 x i64> %in) nounwind {			define amdgpu_kernel void @v5i64_arg(<5 x i64> addrspace(1)* nocapture %out, <5 x i64> %in) nounwind {
	; SI-LABEL: v5i64_arg:			; SI-LABEL: v5i64_arg:
	; SI: ; %bb.0: ; %entry			; SI: ; %bb.0: ; %entry
	; SI-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x19			; SI-NEXT: s_load_dwordx16 s[4:19], s[0:1], 0x19
	; SI-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0x9			; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9
	; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x21			; SI-NEXT: s_mov_b32 s3, 0xf000
	; SI-NEXT: s_mov_b32 s15, 0xf000			; SI-NEXT: s_mov_b32 s2, -1
	; SI-NEXT: s_mov_b32 s14, -1
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s8			; SI-NEXT: v_mov_b32_e32 v0, s8
	; SI-NEXT: v_mov_b32_e32 v1, s9			; SI-NEXT: v_mov_b32_e32 v1, s9
	; SI-NEXT: v_mov_b32_e32 v2, s10			; SI-NEXT: v_mov_b32_e32 v2, s10
	; SI-NEXT: v_mov_b32_e32 v3, s11			; SI-NEXT: v_mov_b32_e32 v3, s11
	; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[12:15], 0 offset:16			; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[0:3], 0 offset:16
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s4			; SI-NEXT: v_mov_b32_e32 v0, s4
	; SI-NEXT: v_mov_b32_e32 v1, s5			; SI-NEXT: v_mov_b32_e32 v1, s5
	; SI-NEXT: v_mov_b32_e32 v2, s6			; SI-NEXT: v_mov_b32_e32 v2, s6
	; SI-NEXT: v_mov_b32_e32 v3, s7			; SI-NEXT: v_mov_b32_e32 v3, s7
	; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[12:15], 0			; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[0:3], 0
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s0			; SI-NEXT: v_mov_b32_e32 v0, s12
	; SI-NEXT: v_mov_b32_e32 v1, s1			; SI-NEXT: v_mov_b32_e32 v1, s13
	; SI-NEXT: buffer_store_dwordx2 v[0:1], off, s[12:15], 0 offset:32			; SI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0 offset:32
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: v5i64_arg:			; VI-LABEL: v5i64_arg:
	; VI: ; %bb.0: ; %entry			; VI: ; %bb.0: ; %entry
	; VI-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24			; VI-NEXT: s_load_dwordx2 s[16:17], s[0:1], 0x24
	; VI-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x64			; VI-NEXT: s_load_dwordx16 s[0:15], s[0:1], 0x64
	; VI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x84
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v0, s8
	; VI-NEXT: s_add_u32 s8, s2, 16
	; VI-NEXT: v_mov_b32_e32 v1, s9
	; VI-NEXT: s_addc_u32 s9, s3, 0
	; VI-NEXT: v_mov_b32_e32 v4, s8
	; VI-NEXT: v_mov_b32_e32 v2, s10
	; VI-NEXT: v_mov_b32_e32 v3, s11
	; VI-NEXT: v_mov_b32_e32 v5, s9
	; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
	; VI-NEXT: v_mov_b32_e32 v5, s3
	; VI-NEXT: v_mov_b32_e32 v4, s2
	; VI-NEXT: s_add_u32 s2, s2, 32
	; VI-NEXT: v_mov_b32_e32 v0, s4			; VI-NEXT: v_mov_b32_e32 v0, s4
				; VI-NEXT: s_add_u32 s4, s16, 16
	; VI-NEXT: v_mov_b32_e32 v1, s5			; VI-NEXT: v_mov_b32_e32 v1, s5
				; VI-NEXT: s_addc_u32 s5, s17, 0
				; VI-NEXT: v_mov_b32_e32 v4, s4
	; VI-NEXT: v_mov_b32_e32 v2, s6			; VI-NEXT: v_mov_b32_e32 v2, s6
	; VI-NEXT: v_mov_b32_e32 v3, s7			; VI-NEXT: v_mov_b32_e32 v3, s7
				; VI-NEXT: v_mov_b32_e32 v5, s5
	; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]			; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
	; VI-NEXT: s_addc_u32 s3, s3, 0			; VI-NEXT: v_mov_b32_e32 v4, s16
	; VI-NEXT: v_mov_b32_e32 v2, s2
	; VI-NEXT: v_mov_b32_e32 v0, s0			; VI-NEXT: v_mov_b32_e32 v0, s0
				; VI-NEXT: s_add_u32 s0, s16, 32
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
				; VI-NEXT: v_mov_b32_e32 v2, s2
	; VI-NEXT: v_mov_b32_e32 v3, s3			; VI-NEXT: v_mov_b32_e32 v3, s3
				; VI-NEXT: v_mov_b32_e32 v5, s17
				; VI-NEXT: s_addc_u32 s1, s17, 0
				; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
				; VI-NEXT: s_nop 0
				; VI-NEXT: v_mov_b32_e32 v3, s1
				; VI-NEXT: v_mov_b32_e32 v0, s8
				; VI-NEXT: v_mov_b32_e32 v1, s9
				; VI-NEXT: v_mov_b32_e32 v2, s0
	; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]			; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: v5i64_arg:			; GFX9-LABEL: v5i64_arg:
	; GFX9: ; %bb.0: ; %entry			; GFX9: ; %bb.0: ; %entry
	; GFX9-NEXT: s_load_dwordx8 s[8:15], s[4:5], 0x40			; GFX9-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
	; GFX9-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x0
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x60
	; GFX9-NEXT: v_mov_b32_e32 v4, 0			; GFX9-NEXT: v_mov_b32_e32 v4, 0
				; GFX9-NEXT: s_load_dwordx16 s[0:15], s[4:5], 0x40
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_mov_b32_e32 v0, s12			; GFX9-NEXT: v_mov_b32_e32 v0, s4
	; GFX9-NEXT: v_mov_b32_e32 v1, s13			; GFX9-NEXT: v_mov_b32_e32 v1, s5
	; GFX9-NEXT: v_mov_b32_e32 v2, s14			; GFX9-NEXT: v_mov_b32_e32 v2, s6
	; GFX9-NEXT: v_mov_b32_e32 v3, s15			; GFX9-NEXT: v_mov_b32_e32 v3, s7
	; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[6:7] offset:16			; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[16:17] offset:16
	; GFX9-NEXT: s_nop 0
	; GFX9-NEXT: v_mov_b32_e32 v0, s8
	; GFX9-NEXT: v_mov_b32_e32 v1, s9
	; GFX9-NEXT: v_mov_b32_e32 v2, s10
	; GFX9-NEXT: v_mov_b32_e32 v3, s11
	; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[6:7]
	; GFX9-NEXT: s_nop 0			; GFX9-NEXT: s_nop 0
	; GFX9-NEXT: v_mov_b32_e32 v0, s0			; GFX9-NEXT: v_mov_b32_e32 v0, s0
	; GFX9-NEXT: v_mov_b32_e32 v1, s1			; GFX9-NEXT: v_mov_b32_e32 v1, s1
	; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[6:7] offset:32			; GFX9-NEXT: v_mov_b32_e32 v2, s2
				; GFX9-NEXT: v_mov_b32_e32 v3, s3
				; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[16:17]
				; GFX9-NEXT: s_nop 0
				; GFX9-NEXT: v_mov_b32_e32 v0, s8
				; GFX9-NEXT: v_mov_b32_e32 v1, s9
				; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[16:17] offset:32
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; EG-LABEL: v5i64_arg:			; EG-LABEL: v5i64_arg:
	; EG: ; %bb.0: ; %entry			; EG: ; %bb.0: ; %entry
	; EG-NEXT: ALU 18, @6, KC0[CB0:0-32], KC1[]			; EG-NEXT: ALU 18, @6, KC0[CB0:0-32], KC1[]
	; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T5.XY, T4.X, 0			; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T5.XY, T4.X, 0
	; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T1.XYZW, T3.X, 0			; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T1.XYZW, T3.X, 0
	; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T0.XYZW, T2.X, 1			; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T0.XYZW, T2.X, 1
	; EG-NEXT: CF_END			; EG-NEXT: CF_END
	; EG-NEXT: PAD			; EG-NEXT: PAD
	; EG-NEXT: ALU clause starting at 6:			; EG-NEXT: ALU clause starting at 6:
	; EG-NEXT: MOV * T0.W, KC0[7].X,			; EG-NEXT: MOV * T0.W, KC0[7].X,
	; EG-NEXT: MOV * T0.Z, KC0[6].W,			; EG-NEXT: MOV * T0.Z, KC0[6].W,
	; EG-NEXT: MOV T0.Y, KC0[6].Z,			; EG-NEXT: MOV T0.Y, KC0[6].Z,
	; EG-NEXT: MOV * T1.W, KC0[8].X,			; EG-NEXT: MOV * T1.W, KC0[8].X,
	; EG-NEXT: MOV T0.X, KC0[6].Y,			; EG-NEXT: MOV T0.X, KC0[6].Y,
	; EG-NEXT: MOV * T1.Z, KC0[7].W,			; EG-NEXT: MOV * T1.Z, KC0[7].W,
	; EG-NEXT: LSHR T2.X, KC0[2].Y, literal.x,			; EG-NEXT: LSHR T2.X, KC0[2].Y, literal.x,
	; EG-NEXT: MOV * T1.Y, KC0[7].Z,			; EG-NEXT: MOV * T1.Y, KC0[7].Z,
	; EG-NEXT: 2(2.802597e-45), 0(0.000000e+00)			; EG-NEXT: 2(2.802597e-45), 0(0.000000e+00)
	; EG-NEXT: MOV T1.X, KC0[7].Y,			; EG-NEXT: MOV T1.X, KC0[7].Y,
	; EG-NEXT: ADD_INT * T2.W, KC0[2].Y, literal.x,			; EG-NEXT: ADD_INT * T2.W, KC0[2].Y, literal.x,
	; EG-NEXT: 16(2.242078e-44), 0(0.000000e+00)			; EG-NEXT: 16(2.242078e-44), 0(0.000000e+00)
	; EG-NEXT: LSHR T3.X, PV.W, literal.x,			; EG-NEXT: LSHR T3.X, PV.W, literal.x,
	; EG-NEXT: ADD_INT * T2.W, KC0[2].Y, literal.y,			; EG-NEXT: ADD_INT * T2.W, KC0[2].Y, literal.y,
	; EG-NEXT: 2(2.802597e-45), 32(4.484155e-44)			; EG-NEXT: 2(2.802597e-45), 32(4.484155e-44)
	; EG-NEXT: LSHR T4.X, PV.W, literal.x,			; EG-NEXT: LSHR T4.X, PV.W, literal.x,
	; EG-NEXT: MOV T5.Y, KC0[8].Z,			; EG-NEXT: MOV T5.Y, KC0[8].Z,
	; EG-NEXT: MOV * T5.X, KC0[8].Y,			; EG-NEXT: MOV * T5.X, KC0[8].Y,
	; EG-NEXT: 2(2.802597e-45), 0(0.000000e+00)			; EG-NEXT: 2(2.802597e-45), 0(0.000000e+00)
	;			;
	; CM-LABEL: v5i64_arg:			; CM-LABEL: v5i64_arg:
	; CM: ; %bb.0: ; %entry			; CM: ; %bb.0: ; %entry
	; CM-NEXT: ALU 18, @6, KC0[CB0:0-32], KC1[]			; CM-NEXT: ALU 18, @6, KC0[CB0:0-32], KC1[]
	; CM-NEXT: MEM_RAT_CACHELESS STORE_DWORD T2, T5.X			; CM-NEXT: MEM_RAT_CACHELESS STORE_DWORD T2, T5.X
	; CM-NEXT: MEM_RAT_CACHELESS STORE_DWORD T0, T4.X			; CM-NEXT: MEM_RAT_CACHELESS STORE_DWORD T0, T4.X
	; CM-NEXT: MEM_RAT_CACHELESS STORE_DWORD T1, T3.X			; CM-NEXT: MEM_RAT_CACHELESS STORE_DWORD T1, T3.X
	; CM-NEXT: CF_END			; CM-NEXT: CF_END
	; CM-NEXT: PAD			; CM-NEXT: PAD
	; CM-NEXT: ALU clause starting at 6:			; CM-NEXT: ALU clause starting at 6:
	; CM-NEXT: MOV * T0.W, KC0[8].X,			; CM-NEXT: MOV * T0.W, KC0[8].X,
	; CM-NEXT: MOV T1.Y, KC0[8].Z,			; CM-NEXT: MOV T1.Y, KC0[8].Z,
	; CM-NEXT: MOV * T0.Z, KC0[7].W,			; CM-NEXT: MOV * T0.Z, KC0[7].W,
	; CM-NEXT: MOV T1.X, KC0[8].Y,			; CM-NEXT: MOV T1.X, KC0[8].Y,
	; CM-NEXT: MOV * T0.Y, KC0[7].Z,			; CM-NEXT: MOV * T0.Y, KC0[7].Z,
	; CM-NEXT: MOV T0.X, KC0[7].Y,			; CM-NEXT: MOV T0.X, KC0[7].Y,
	; CM-NEXT: ADD_INT T1.Z, KC0[2].Y, literal.x,			; CM-NEXT: ADD_INT T1.Z, KC0[2].Y, literal.x,
	; CM-NEXT: MOV * T2.W, KC0[7].X,			; CM-NEXT: MOV * T2.W, KC0[7].X,
	; CM-NEXT: 32(4.484155e-44), 0(0.000000e+00)			; CM-NEXT: 32(4.484155e-44), 0(0.000000e+00)
	; CM-NEXT: LSHR T3.X, PV.Z, literal.x,			; CM-NEXT: LSHR T3.X, PV.Z, literal.x,
	; CM-NEXT: MOV T2.Z, KC0[6].W,			; CM-NEXT: MOV T2.Z, KC0[6].W,
	; CM-NEXT: ADD_INT * T1.W, KC0[2].Y, literal.y,			; CM-NEXT: ADD_INT * T1.W, KC0[2].Y, literal.y,
	; CM-NEXT: 2(2.802597e-45), 16(2.242078e-44)			; CM-NEXT: 2(2.802597e-45), 16(2.242078e-44)
	; CM-NEXT: LSHR T4.X, PV.W, literal.x,			; CM-NEXT: LSHR T4.X, PV.W, literal.x,
	; CM-NEXT: MOV * T2.Y, KC0[6].Z,			; CM-NEXT: MOV * T2.Y, KC0[6].Z,
	; CM-NEXT: 2(2.802597e-45), 0(0.000000e+00)			; CM-NEXT: 2(2.802597e-45), 0(0.000000e+00)
	; CM-NEXT: MOV * T2.X, KC0[6].Y,			; CM-NEXT: MOV * T2.X, KC0[6].Y,
	; CM-NEXT: LSHR * T5.X, KC0[2].Y, literal.x,			; CM-NEXT: LSHR * T5.X, KC0[2].Y, literal.x,
	; CM-NEXT: 2(2.802597e-45), 0(0.000000e+00)			; CM-NEXT: 2(2.802597e-45), 0(0.000000e+00)
				foadUnsubmitted Not Done Reply Inline Actions I think this diff is misleading. I've pushed an alternative version here, which is based on first autogenerating the checks in this file before applying D106447: https://github.com/jayfoad/llvm-project/commit/ed4a53da7c265675510caeb2c773f1b461354347 In this particular case (v5i64_arg for SI) the relevant part of the diff is: -; SI-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x19 -; SI-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0x9 -; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x21 -; SI-NEXT: s_mov_b32 s15, 0xf000 -; SI-NEXT: s_mov_b32 s14, -1 +; SI-NEXT: s_load_dwordx16 s[4:19], s[0:1], 0x19 +; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9 +; SI-NEXT: s_mov_b32 s3, 0xf000 +; SI-NEXT: s_mov_b32 s2, -1 I.e. the load from scaled offset 0x9 is untouched; the significant difference is that the dwordx8 and dwordx2 load from offsets 0x19 and 0x21 have been replaced by a single dwordx16 load from offset 0x19. foad: I think this diff is misleading. I've pushed an alternative version here, which is based on…
				lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions That's why nobody likes manually-written check lines.. Let me rebase this.. lebedev.ri: That's why nobody likes manually-written check lines.. Let me rebase this..
				lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Actually, i see that you have not pushed https://github.com/llvm/llvm-project/commit/00b0f1ed7d5833695ba266854ee1d07467ddcf9c, so there isn't much point in rebasing. Do you want to push that one? lebedev.ri: Actually, i see that you have not pushed https://github.com/llvm/llvm…
				foadUnsubmitted Not Done Reply Inline Actions Pushed for review as D107052 because I had to make some non-trivial changes to the RUN lines. foad: Pushed for review as D107052 because I had to make some non-trivial changes to the RUN lines.
	entry:			entry:
	store <5 x i64> %in, <5 x i64> addrspace(1)* %out, align 8			store <5 x i64> %in, <5 x i64> addrspace(1)* %out, align 8
	ret void			ret void
	}			}

	define amdgpu_kernel void @v5f64_arg(<5 x double> addrspace(1)* nocapture %out, <5 x double> %in) nounwind {			define amdgpu_kernel void @v5f64_arg(<5 x double> addrspace(1)* nocapture %out, <5 x double> %in) nounwind {
	; SI-LABEL: v5f64_arg:			; SI-LABEL: v5f64_arg:
	; SI: ; %bb.0: ; %entry			; SI: ; %bb.0: ; %entry
	; SI-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x19			; SI-NEXT: s_load_dwordx16 s[4:19], s[0:1], 0x19
	; SI-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0x9			; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9
	; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x21			; SI-NEXT: s_mov_b32 s3, 0xf000
	; SI-NEXT: s_mov_b32 s15, 0xf000			; SI-NEXT: s_mov_b32 s2, -1
	; SI-NEXT: s_mov_b32 s14, -1
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s8			; SI-NEXT: v_mov_b32_e32 v0, s8
	; SI-NEXT: v_mov_b32_e32 v1, s9			; SI-NEXT: v_mov_b32_e32 v1, s9
	; SI-NEXT: v_mov_b32_e32 v2, s10			; SI-NEXT: v_mov_b32_e32 v2, s10
	; SI-NEXT: v_mov_b32_e32 v3, s11			; SI-NEXT: v_mov_b32_e32 v3, s11
	; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[12:15], 0 offset:16			; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[0:3], 0 offset:16
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s4			; SI-NEXT: v_mov_b32_e32 v0, s4
	; SI-NEXT: v_mov_b32_e32 v1, s5			; SI-NEXT: v_mov_b32_e32 v1, s5
	; SI-NEXT: v_mov_b32_e32 v2, s6			; SI-NEXT: v_mov_b32_e32 v2, s6
	; SI-NEXT: v_mov_b32_e32 v3, s7			; SI-NEXT: v_mov_b32_e32 v3, s7
	; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[12:15], 0			; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[0:3], 0
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s0			; SI-NEXT: v_mov_b32_e32 v0, s12
	; SI-NEXT: v_mov_b32_e32 v1, s1			; SI-NEXT: v_mov_b32_e32 v1, s13
	; SI-NEXT: buffer_store_dwordx2 v[0:1], off, s[12:15], 0 offset:32			; SI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0 offset:32
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: v5f64_arg:			; VI-LABEL: v5f64_arg:
	; VI: ; %bb.0: ; %entry			; VI: ; %bb.0: ; %entry
	; VI-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24			; VI-NEXT: s_load_dwordx2 s[16:17], s[0:1], 0x24
	; VI-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x64			; VI-NEXT: s_load_dwordx16 s[0:15], s[0:1], 0x64
	; VI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x84
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v0, s8
	; VI-NEXT: s_add_u32 s8, s2, 16
	; VI-NEXT: v_mov_b32_e32 v1, s9
	; VI-NEXT: s_addc_u32 s9, s3, 0
	; VI-NEXT: v_mov_b32_e32 v4, s8
	; VI-NEXT: v_mov_b32_e32 v2, s10
	; VI-NEXT: v_mov_b32_e32 v3, s11
	; VI-NEXT: v_mov_b32_e32 v5, s9
	; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
	; VI-NEXT: v_mov_b32_e32 v5, s3
	; VI-NEXT: v_mov_b32_e32 v4, s2
	; VI-NEXT: s_add_u32 s2, s2, 32
	; VI-NEXT: v_mov_b32_e32 v0, s4			; VI-NEXT: v_mov_b32_e32 v0, s4
				; VI-NEXT: s_add_u32 s4, s16, 16
	; VI-NEXT: v_mov_b32_e32 v1, s5			; VI-NEXT: v_mov_b32_e32 v1, s5
				; VI-NEXT: s_addc_u32 s5, s17, 0
				; VI-NEXT: v_mov_b32_e32 v4, s4
	; VI-NEXT: v_mov_b32_e32 v2, s6			; VI-NEXT: v_mov_b32_e32 v2, s6
	; VI-NEXT: v_mov_b32_e32 v3, s7			; VI-NEXT: v_mov_b32_e32 v3, s7
				; VI-NEXT: v_mov_b32_e32 v5, s5
	; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]			; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
	; VI-NEXT: s_addc_u32 s3, s3, 0			; VI-NEXT: v_mov_b32_e32 v4, s16
	; VI-NEXT: v_mov_b32_e32 v2, s2
	; VI-NEXT: v_mov_b32_e32 v0, s0			; VI-NEXT: v_mov_b32_e32 v0, s0
				; VI-NEXT: s_add_u32 s0, s16, 32
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
				; VI-NEXT: v_mov_b32_e32 v2, s2
	; VI-NEXT: v_mov_b32_e32 v3, s3			; VI-NEXT: v_mov_b32_e32 v3, s3
				; VI-NEXT: v_mov_b32_e32 v5, s17
				; VI-NEXT: s_addc_u32 s1, s17, 0
				; VI-NEXT: flat_store_dwordx4 v[4:5], v[0:3]
				; VI-NEXT: s_nop 0
				; VI-NEXT: v_mov_b32_e32 v3, s1
				; VI-NEXT: v_mov_b32_e32 v0, s8
				; VI-NEXT: v_mov_b32_e32 v1, s9
				; VI-NEXT: v_mov_b32_e32 v2, s0
	; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]			; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: v5f64_arg:			; GFX9-LABEL: v5f64_arg:
	; GFX9: ; %bb.0: ; %entry			; GFX9: ; %bb.0: ; %entry
	; GFX9-NEXT: s_load_dwordx8 s[8:15], s[4:5], 0x40			; GFX9-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
	; GFX9-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x0
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x60
	; GFX9-NEXT: v_mov_b32_e32 v4, 0			; GFX9-NEXT: v_mov_b32_e32 v4, 0
				; GFX9-NEXT: s_load_dwordx16 s[0:15], s[4:5], 0x40
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_mov_b32_e32 v0, s12			; GFX9-NEXT: v_mov_b32_e32 v0, s4
	; GFX9-NEXT: v_mov_b32_e32 v1, s13			; GFX9-NEXT: v_mov_b32_e32 v1, s5
	; GFX9-NEXT: v_mov_b32_e32 v2, s14			; GFX9-NEXT: v_mov_b32_e32 v2, s6
	; GFX9-NEXT: v_mov_b32_e32 v3, s15			; GFX9-NEXT: v_mov_b32_e32 v3, s7
	; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[6:7] offset:16			; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[16:17] offset:16
	; GFX9-NEXT: s_nop 0
	; GFX9-NEXT: v_mov_b32_e32 v0, s8
	; GFX9-NEXT: v_mov_b32_e32 v1, s9
	; GFX9-NEXT: v_mov_b32_e32 v2, s10
	; GFX9-NEXT: v_mov_b32_e32 v3, s11
	; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[6:7]
	; GFX9-NEXT: s_nop 0			; GFX9-NEXT: s_nop 0
	; GFX9-NEXT: v_mov_b32_e32 v0, s0			; GFX9-NEXT: v_mov_b32_e32 v0, s0
	; GFX9-NEXT: v_mov_b32_e32 v1, s1			; GFX9-NEXT: v_mov_b32_e32 v1, s1
	; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[6:7] offset:32			; GFX9-NEXT: v_mov_b32_e32 v2, s2
				; GFX9-NEXT: v_mov_b32_e32 v3, s3
				; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[16:17]
				; GFX9-NEXT: s_nop 0
				; GFX9-NEXT: v_mov_b32_e32 v0, s8
				; GFX9-NEXT: v_mov_b32_e32 v1, s9
				; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[16:17] offset:32
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; EG-LABEL: v5f64_arg:			; EG-LABEL: v5f64_arg:
	; EG: ; %bb.0: ; %entry			; EG: ; %bb.0: ; %entry
	; EG-NEXT: ALU 18, @6, KC0[CB0:0-32], KC1[]			; EG-NEXT: ALU 18, @6, KC0[CB0:0-32], KC1[]
	; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T5.XY, T4.X, 0			; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T5.XY, T4.X, 0
	; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T1.XYZW, T3.X, 0			; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T1.XYZW, T3.X, 0
	; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T0.XYZW, T2.X, 1			; EG-NEXT: MEM_RAT_CACHELESS STORE_RAW T0.XYZW, T2.X, 1
	▲ Show 20 Lines • Show All 3,882 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/load-partial-dot-product.ll

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%dot01 = fadd float %mul0, %mul1		%dot01 = fadd float %mul0, %mul1
%dot012 = fadd float %dot01, %mul2		%dot012 = fadd float %dot01, %mul2
ret float %dot012		ret float %dot012
}		}

define float @dot3_float3(float* dereferenceable(16) %a0, float* dereferenceable(16) %a1) {		define float @dot3_float3(float* dereferenceable(16) %a0, float* dereferenceable(16) %a1) {
; SSE2-LABEL: dot3_float3:		; SSE2-LABEL: dot3_float3:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero		; SSE2-NEXT: movups (%rdi), %xmm0
; SSE2-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero		; SSE2-NEXT: movups (%rsi), %xmm1
; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[3,0]
; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,2]
; SSE2-NEXT: movsd {{.*#+}} xmm1 = mem[0],zero
; SSE2-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,0],xmm1[3,0]
; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0,2]
; SSE2-NEXT: mulps %xmm0, %xmm1		; SSE2-NEXT: mulps %xmm0, %xmm1
; SSE2-NEXT: movaps %xmm1, %xmm0		; SSE2-NEXT: movaps %xmm1, %xmm0
; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm1[1,1]		; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1],xmm1[1,1]
; SSE2-NEXT: addss %xmm1, %xmm0		; SSE2-NEXT: addss %xmm1, %xmm0
; SSE2-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]		; SSE2-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]
; SSE2-NEXT: addss %xmm1, %xmm0		; SSE2-NEXT: addss %xmm1, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSSE3-LABEL: dot3_float3:		; SSSE3-LABEL: dot3_float3:
; SSSE3: # %bb.0:		; SSSE3: # %bb.0:
; SSSE3-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero		; SSSE3-NEXT: movups (%rdi), %xmm0
; SSSE3-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero		; SSSE3-NEXT: movups (%rsi), %xmm1
; SSSE3-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[3,0]
; SSSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,2]
; SSSE3-NEXT: movsd {{.*#+}} xmm1 = mem[0],zero
; SSSE3-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; SSSE3-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,0],xmm1[3,0]
; SSSE3-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0,2]
; SSSE3-NEXT: mulps %xmm0, %xmm1		; SSSE3-NEXT: mulps %xmm0, %xmm1
; SSSE3-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]		; SSSE3-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
; SSSE3-NEXT: addss %xmm1, %xmm0		; SSSE3-NEXT: addss %xmm1, %xmm0
; SSSE3-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]		; SSSE3-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]
; SSSE3-NEXT: addss %xmm1, %xmm0		; SSSE3-NEXT: addss %xmm1, %xmm0
; SSSE3-NEXT: retq		; SSSE3-NEXT: retq
;		;
; SSE41-LABEL: dot3_float3:		; SSE41-LABEL: dot3_float3:
; SSE41: # %bb.0:		; SSE41: # %bb.0:
; SSE41-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero		; SSE41-NEXT: movups (%rdi), %xmm0
; SSE41-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]		; SSE41-NEXT: movups (%rsi), %xmm1
; SSE41-NEXT: movsd {{.*#+}} xmm1 = mem[0],zero
; SSE41-NEXT: insertps {{.*#+}} xmm1 = xmm1[0,1],mem[0],xmm1[3]
; SSE41-NEXT: mulps %xmm0, %xmm1		; SSE41-NEXT: mulps %xmm0, %xmm1
; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]		; SSE41-NEXT: movshdup {{.*#+}} xmm0 = xmm1[1,1,3,3]
; SSE41-NEXT: addss %xmm1, %xmm0		; SSE41-NEXT: addss %xmm1, %xmm0
; SSE41-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]		; SSE41-NEXT: movhlps {{.*#+}} xmm1 = xmm1[1,1]
; SSE41-NEXT: addss %xmm1, %xmm0		; SSE41-NEXT: addss %xmm1, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX-LABEL: dot3_float3:		; AVX-LABEL: dot3_float3:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero		; AVX-NEXT: vmovups (%rdi), %xmm0
; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1],mem[0],xmm0[3]		; AVX-NEXT: vmulps (%rsi), %xmm0, %xmm0
; AVX-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero
; AVX-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],mem[0],xmm1[3]
; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]		; AVX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]
; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm0[1,0]		; AVX-NEXT: vpermilpd {{.*#+}} xmm2 = xmm0[1,0]
; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0		; AVX-NEXT: vaddss %xmm1, %xmm0, %xmm0
; AVX-NEXT: vaddss %xmm2, %xmm0, %xmm0		; AVX-NEXT: vaddss %xmm2, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%bcx012 = bitcast float* %a0 to <3 x float>*		%bcx012 = bitcast float* %a0 to <3 x float>*
%bcy012 = bitcast float* %a1 to <3 x float>*		%bcy012 = bitcast float* %a1 to <3 x float>*
%x012 = load <3 x float>, <3 x float>* %bcx012, align 4		%x012 = load <3 x float>, <3 x float>* %bcx012, align 4
▲ Show 20 Lines • Show All 239 Lines • Show Last 20 Lines