This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
-
MachineScheduler.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
1/2
overeager_mla_fusing.ll
-
AMDGPU/
-
GlobalISel/
-
llvm.amdgcn.atomic.inc.ll
-
bitcast-vector-extract.ll
1/3
callee-special-input-vgprs.ll
2
captured-frame-index.ll
-
chain-hi-to-lo.ll
-
extract_vector_elt-i8.ll
1
fast-unaligned-load-store.global.ll
-
global-saddr.ll
-
sched-assert-onlydbg-value-empty-region.mir
-
sign_extend.ll

Differential D74524

[Scheduling] Improve memory ops cluster preparation
AbandonedPublic

Authored by qiucf on Feb 12 2020, 9:47 PM.

Download Raw Diff

Details

Reviewers

fhahn
evandro
arsenm
foad

Group Reviewers

Restricted Project

Summary

SUnits in ScheduleDAGInstrs will be divided into several groups by their ctrl pred node number. Scheduler tries to build possible cluster edges inside each group. However, there're some units with no ctrl preds.

This patch

Add these preds to each group before clusterNeighboringMemOps, to catch more cluster opportunities.
Add a bitmap to mark units already clustered, remove them before clusterNeighboringMemOps

This impacts several test cases, by comparing scheduling log, we can find more cluster edges added and they are scheduled together.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	10 ms	LLVM-Unit.Frontend/_/LLVMFrontendTests::Unknown Unit Message ("")
	10 ms	LLVM.Bindings/Go::Unknown Unit Message ("")
	190 ms	LLVM.CodeGen/AMDGPU::Unknown Unit Message ("")

Event Timeline

qiucf created this revision.Feb 12 2020, 9:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 12 2020, 9:47 PM

Herald added subscribers: llvm-commits, kerbowa, arphaman and 6 others. · View Herald Transcript

qiucf edited the summary of this revision. (Show Details)Feb 12 2020, 9:54 PM

Harbormaster failed remote builds in B46379: Diff 244333!Feb 12 2020, 10:13 PM

kerbowa added subscribers: rampitec, foad.Feb 12 2020, 10:22 PM

Fix build failure

rampitec added a reviewer: foad.Feb 12 2020, 10:51 PM

Harbormaster failed remote builds in B46383: Diff 244339!Feb 12 2020, 10:53 PM

rampitec added inline comments.Feb 12 2020, 11:11 PM

llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
494	This is a regression I guess. A memory operation should always go before an independent ALU as it has higher latency.
524	And then loads are preferably go before stores. Loads have higher latency and their results needs to be consumed by some other instruction. So it looks like a regression to me either.
llvm/test/CodeGen/AMDGPU/captured-frame-index.ll
55	And then this is a progression, as two stores are scheduled together. It would be nice to understand if they were clustered or is it a coincidence.
71	Same here. It seems to help store clustering.
llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
61	Stores are clustered again, here and below, which is nice improvement.

foad added inline comments.Feb 13 2020, 1:24 AM

llvm/test/CodeGen/AArch64/overeager_mla_fusing.ll
12	This looks suspicious. Why has it stopped clustering the two loads from x0, and the two loads from x1?

Thanks for working on this. I have to admit I have never understood the store chain stuff in BaseMemOpClusterMutation::apply. Can you point me at any documentation that explains what a store chain is, and why it's based on control dependence, and why it's useful for clustering?

Is this approach generally sound? I worry that you may get circular dependencies, e.g.: A has no control dependency, but B has a control dependency on A (possibly indirectly). Clustering introduces new artificial dependencies, which could potentially lead to A becoming dependent on an SUnit C that depends on B, creating a circular dependency.

qiucf marked 2 inline comments as done.Feb 16 2020, 7:06 PM

qiucf added inline comments.

llvm/test/CodeGen/AArch64/overeager_mla_fusing.ll
12	AArch64's `shouldClusterMemOps` requires the offset should be `N` and `N+1`. Here is `2` and `6`. They are not clustered in the log. But the scheduler really has some issues about clustering. It doesn't know `SU(13)` and `SU(15)` can't be scheduled together (`SU(14)` must be in). This can be fixed by not creating this edge when `SUa` and `SUb` can't be scheduled together. We can do it in future patches.
llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
494	Yes. This can and another regression in this file will be eliminated if we prevent cluster edge from being created when they can't be scheduled together, as above says.

In D74524#1873828, @foad wrote:

Thanks for working on this. I have to admit I have never understood the store chain stuff in BaseMemOpClusterMutation::apply. Can you point me at any documentation that explains what a store chain is, and why it's based on control dependence, and why it's useful for clustering?

Per my understanding (maybe not correct), ‘chains’ are used to describe some kinds of dependency other than use-def, such as memory operations. (This discussion is a good reference) So it’s natural for the chains to become anti, output, or order dependencies.

About why it's useful for clustering, I guess what apply does is a cheap but not perfect way to eliminate 'impossible' cluster pairs (like SU(3) clustered with SU(5) but SU(4) is a barrier), since if two units has the same pred-dep, they're likely to be able to neighbor. Actually, as I roughly change if (Pred.isCtrl() && !Pred.isArtificial()) to if (Pred.isNormalMemoryOrBarrier()), all check-llvm tests passed. Since this method was written by @atrick several years ago and its core logic haven't changed, it's sometimes also confusing to me :) There're some other issue with this implementation, nevertheless, they may save compiling time.

In D74524#1873912, @nhaehnle wrote:

Is this approach generally sound? I worry that you may get circular dependencies, e.g.: A has no control dependency, but B has a control dependency on A (possibly indirectly). Clustering introduces new artificial dependencies, which could potentially lead to A becoming dependent on an SUnit C that depends on B, creating a circular dependency.

Currently (before D72031 lands), after cluster edges created, only succ-deps (assume that's C) of A will depend on B. If we want circular dependency to happen, B has to directly depends on C or depends on D which depends on C. But we calls Topo.IsReachable before adding edges and that would be detected. (A can't depends on C because C depends on A) This seems not related to whether A or B has control dependencies. Am I right?

+->B+---->A<-+
|  +         |
|  +>D+      |
|     |      |
|     v      |
+----+C+-----+

Update to rebase and reflect new test changes.

Harbormaster failed remote builds in B46882: Diff 245570!Feb 19 2020, 8:49 PM

I think, it makes sense for this patch. Can you please rebase the patch and double check all the test changes ?

steven.zhang mentioned this in D85517: [Scheduling] Implement a new way to cluster loads/stores.Aug 7 2020, 4:57 AM

I have proposed a new algorithm to do the memory cluster in D85517. Would you please verify if it can solve your problem ...

In D74524#2206176, @steven.zhang wrote:

I have proposed a new algorithm to do the memory cluster in D85517. Would you please verify if it can solve your problem ...

Yes. The patch resolves my original motivation fusion issue. Thanks for the improvement.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineScheduler.cpp

26 lines

test/

CodeGen/

AArch64/

overeager_mla_fusing.ll

12 lines

AMDGPU/

GlobalISel/

llvm.amdgcn.atomic.inc.ll

14 lines

bitcast-vector-extract.ll

4 lines

callee-special-input-vgprs.ll

6 lines

captured-frame-index.ll

6 lines

chain-hi-to-lo.ll

8 lines

extract_vector_elt-i8.ll

2 lines

fast-unaligned-load-store.global.ll

71 lines

global-saddr.ll

2 lines

sched-assert-onlydbg-value-empty-region.mir

4 lines

sign_extend.ll

24 lines

Diff 244333

llvm/lib/CodeGen/MachineScheduler.cpp

Show All 11 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/CodeGen/MachineScheduler.h"		#include "llvm/CodeGen/MachineScheduler.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/BitVector.h"		#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/PriorityQueue.h"		#include "llvm/ADT/PriorityQueue.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/LiveInterval.h"		#include "llvm/CodeGen/LiveInterval.h"
#include "llvm/CodeGen/LiveIntervals.h"		#include "llvm/CodeGen/LiveIntervals.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineDominators.h"		#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
▲ Show 20 Lines • Show All 1,492 Lines • ▼ Show 20 Lines
public:		public:
BaseMemOpClusterMutation(const TargetInstrInfo *tii,		BaseMemOpClusterMutation(const TargetInstrInfo *tii,
const TargetRegisterInfo *tri, bool IsLoad)		const TargetRegisterInfo *tri, bool IsLoad)
: TII(tii), TRI(tri), IsLoad(IsLoad) {}		: TII(tii), TRI(tri), IsLoad(IsLoad) {}

void apply(ScheduleDAGInstrs *DAGInstrs) override;		void apply(ScheduleDAGInstrs *DAGInstrs) override;

protected:		protected:
void clusterNeighboringMemOps(ArrayRef<SUnit > MemOps, ScheduleDAGInstrs DAG);		void clusterNeighboringMemOps(ArrayRef<SUnit *> MemOps,
		ScheduleDAGInstrs *DAG,
		SmallBitVector &ClusteredUnits);
};		};

class StoreClusterMutation : public BaseMemOpClusterMutation {		class StoreClusterMutation : public BaseMemOpClusterMutation {
public:		public:
StoreClusterMutation(const TargetInstrInfo *tii,		StoreClusterMutation(const TargetInstrInfo *tii,
const TargetRegisterInfo *tri)		const TargetRegisterInfo *tri)
: BaseMemOpClusterMutation(tii, tri, false) {}		: BaseMemOpClusterMutation(tii, tri, false) {}
};		};
Show All 20 Lines	createStoreClusterDAGMutation(const TargetInstrInfo *TII,
const TargetRegisterInfo *TRI) {		const TargetRegisterInfo *TRI) {
return EnableMemOpCluster ? std::make_unique<StoreClusterMutation>(TII, TRI)		return EnableMemOpCluster ? std::make_unique<StoreClusterMutation>(TII, TRI)
: nullptr;		: nullptr;
}		}

} // end namespace llvm		} // end namespace llvm

void BaseMemOpClusterMutation::clusterNeighboringMemOps(		void BaseMemOpClusterMutation::clusterNeighboringMemOps(
ArrayRef<SUnit > MemOps, ScheduleDAGInstrs DAG) {		ArrayRef<SUnit > MemOps, ScheduleDAGInstrs DAG,
		SmallBitVector &ClusteredUnits) {
SmallVector<MemOpInfo, 32> MemOpRecords;		SmallVector<MemOpInfo, 32> MemOpRecords;
for (SUnit *SU : MemOps) {		for (SUnit *SU : MemOps) {
		// Skip those already clustered
		if (ClusteredUnits.test(SU->NodeNum))
		continue;
SmallVector<const MachineOperand *, 4> BaseOps;		SmallVector<const MachineOperand *, 4> BaseOps;
int64_t Offset;		int64_t Offset;
if (TII->getMemOperandsWithOffset(*SU->getInstr(), BaseOps, Offset, TRI))		if (TII->getMemOperandsWithOffset(*SU->getInstr(), BaseOps, Offset, TRI))
MemOpRecords.push_back(MemOpInfo(SU, BaseOps, Offset));		MemOpRecords.push_back(MemOpInfo(SU, BaseOps, Offset));
#ifndef NDEBUG		#ifndef NDEBUG
for (auto *Op : BaseOps)		for (auto *Op : BaseOps)
assert(Op);		assert(Op);
#endif		#endif
Show All 21 Lines	if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
for (const SDep &Succ : SUa->Succs) {		for (const SDep &Succ : SUa->Succs) {
if (Succ.getSUnit() == SUb)		if (Succ.getSUnit() == SUb)
continue;		continue;
LLVM_DEBUG(dbgs()		LLVM_DEBUG(dbgs()
<< " Copy Succ SU(" << Succ.getSUnit()->NodeNum << ")\n");		<< " Copy Succ SU(" << Succ.getSUnit()->NodeNum << ")\n");
DAG->addEdge(Succ.getSUnit(), SDep(SUb, SDep::Artificial));		DAG->addEdge(Succ.getSUnit(), SDep(SUb, SDep::Artificial));
}		}
++ClusterLength;		++ClusterLength;
		ClusteredUnits.set(SUa->NodeNum);
		ClusteredUnits.set(SUb->NodeNum);
} else		} else
ClusterLength = 1;		ClusterLength = 1;
} else		} else
ClusterLength = 1;		ClusterLength = 1;
}		}
}		}

/// Callback from DAG postProcessing to create cluster edges for loads.		/// Callback from DAG postProcessing to create cluster edges for loads.
Show All 12 Lines	for (const SDep &Pred : SU.Preds) {
break;		break;
}		}
}		}
// Insert the SU to corresponding store chain.		// Insert the SU to corresponding store chain.
auto &Chain = StoreChains.FindAndConstruct(ChainPredID).second;		auto &Chain = StoreChains.FindAndConstruct(ChainPredID).second;
Chain.push_back(&SU);		Chain.push_back(&SU);
}		}

// Iterate over the store chains.		// Iterate over the store chains. Each time, insert units without any ctrl
for (auto &SCD : StoreChains)		// preds into other groups.
clusterNeighboringMemOps(SCD.second, DAG);		SmallBitVector ClusteredUnits(DAG->SUnits.size());
		const auto &Free = StoreChains.FindAndConstruct(DAG->SUnits.size()).second;
		for (auto &SCD : StoreChains) {
		if (SCD.first != DAG->SUnits.size())
		for (SUnit *S : Free)
		SCD.second.push_back(S);
		clusterNeighboringMemOps(SCD.second, DAG, ClusteredUnits);
		}
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// CopyConstrain - DAG post-processing to encourage copy elimination.		// CopyConstrain - DAG post-processing to encourage copy elimination.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {

▲ Show 20 Lines • Show All 2,128 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/overeager_mla_fusing.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc %s --mtriple aarch64 -verify-machineinstrs -o - \| FileCheck %s			; RUN: llc %s --mtriple aarch64 -verify-machineinstrs -o - \| FileCheck %s

	define dso_local void @jsimd_idct_ifast_neon_intrinsic(i8* nocapture readonly %dct_table, i16* nocapture readonly %coef_block, i8** nocapture readonly %output_buf, i32 %output_col) local_unnamed_addr #0 {			define dso_local void @jsimd_idct_ifast_neon_intrinsic(i8* nocapture readonly %dct_table, i16* nocapture readonly %coef_block, i8** nocapture readonly %output_buf, i32 %output_col) local_unnamed_addr #0 {
	; CHECK-LABEL: jsimd_idct_ifast_neon_intrinsic:			; CHECK-LABEL: jsimd_idct_ifast_neon_intrinsic:
	; CHECK: // %bb.0: // %entry			; CHECK: .Ljsimd_idct_ifast_neon_intrinsic$local:
				; CHECK-NEXT: .cfi_startproc
				; CHECK-NEXT: // %bb.0: // %entry
	; CHECK-NEXT: ldr q0, [x1, #32]			; CHECK-NEXT: ldr q0, [x1, #32]
	; CHECK-NEXT: ldr q1, [x1, #96]			; CHECK-NEXT: ldr q1, [x0, #32]
	; CHECK-NEXT: ldr q2, [x0, #32]			; CHECK-NEXT: ldr q2, [x1, #96]
	; CHECK-NEXT: ldr q3, [x0, #96]			; CHECK-NEXT: ldr q3, [x0, #96]
				foadUnsubmitted Not Done Reply Inline Actions This looks suspicious. Why has it stopped clustering the two loads from x0, and the two loads from x1? foad: This looks suspicious. Why has it stopped clustering the two loads from x0, and the two loads…
				qiucfAuthorUnsubmitted Done Reply Inline Actions AArch64's `shouldClusterMemOps` requires the offset should be `N` and `N+1`. Here is `2` and `6`. They are not clustered in the log. But the scheduler really has some issues about clustering. It doesn't know `SU(13)` and `SU(15)` can't be scheduled together (`SU(14)` must be in). This can be fixed by not creating this edge when `SUa` and `SUb` can't be scheduled together. We can do it in future patches. qiucf: AArch64's `shouldClusterMemOps` requires the offset should be `N` and `N+1`. Here is `2` and…
	; CHECK-NEXT: ldr x8, [x2, #48]			; CHECK-NEXT: ldr x8, [x2, #48]
				; CHECK-NEXT: mul v0.8h, v1.8h, v0.8h
	; CHECK-NEXT: mov w9, w3			; CHECK-NEXT: mov w9, w3
	; CHECK-NEXT: mul v0.8h, v2.8h, v0.8h			; CHECK-NEXT: mul v1.8h, v3.8h, v2.8h
	; CHECK-NEXT: mul v1.8h, v3.8h, v1.8h
	; CHECK-NEXT: add v2.8h, v0.8h, v1.8h			; CHECK-NEXT: add v2.8h, v0.8h, v1.8h
	; CHECK-NEXT: str q2, [x8, x9]			; CHECK-NEXT: str q2, [x8, x9]
	; CHECK-NEXT: ldr x8, [x2, #56]			; CHECK-NEXT: ldr x8, [x2, #56]
	; CHECK-NEXT: sub v0.8h, v0.8h, v1.8h			; CHECK-NEXT: sub v0.8h, v0.8h, v1.8h
	; CHECK-NEXT: str q0, [x8, x9]			; CHECK-NEXT: str q0, [x8, x9]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%add.ptr5 = getelementptr inbounds i16, i16* %coef_block, i64 16			%add.ptr5 = getelementptr inbounds i16, i16* %coef_block, i64 16
	Show All 37 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.atomic.inc.ll

	Show First 20 Lines • Show All 1,589 Lines • ▼ Show 20 Lines
	; VI-NEXT: s_waitcnt lgkmcnt(1)			; VI-NEXT: s_waitcnt lgkmcnt(1)
	; VI-NEXT: flat_store_dword v[0:1], v5			; VI-NEXT: flat_store_dword v[0:1], v5
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: nocse_lds_atomic_inc_ret_i32:			; GFX9-LABEL: nocse_lds_atomic_inc_ret_i32:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; GFX9-NEXT: s_load_dword s4, s[4:5], 0x10			; GFX9-NEXT: s_load_dword s4, s[4:5], 0x10
	; GFX9-NEXT: v_mov_b32_e32 v0, 42			; GFX9-NEXT: v_mov_b32_e32 v4, 42
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_mov_b32_e32 v2, s2
	; GFX9-NEXT: v_mov_b32_e32 v1, s4
	; GFX9-NEXT: ds_inc_rtn_u32 v4, v1, v0
	; GFX9-NEXT: ds_inc_rtn_u32 v5, v1, v0
	; GFX9-NEXT: v_mov_b32_e32 v0, s0			; GFX9-NEXT: v_mov_b32_e32 v0, s0
				; GFX9-NEXT: v_mov_b32_e32 v5, s4
				; GFX9-NEXT: ds_inc_rtn_u32 v6, v5, v4
				; GFX9-NEXT: ds_inc_rtn_u32 v4, v5, v4
				; GFX9-NEXT: v_mov_b32_e32 v2, s2
	; GFX9-NEXT: v_mov_b32_e32 v1, s1			; GFX9-NEXT: v_mov_b32_e32 v1, s1
	; GFX9-NEXT: v_mov_b32_e32 v3, s3			; GFX9-NEXT: v_mov_b32_e32 v3, s3
	; GFX9-NEXT: s_waitcnt lgkmcnt(1)			; GFX9-NEXT: s_waitcnt lgkmcnt(1)
	; GFX9-NEXT: global_store_dword v[0:1], v4, off			; GFX9-NEXT: global_store_dword v[0:1], v6, off
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_store_dword v[2:3], v5, off			; GFX9-NEXT: global_store_dword v[2:3], v4, off
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	%result0 = call i32 @llvm.amdgcn.atomic.inc.i32.p3i32(i32 addrspace(3)* %ptr, i32 42, i32 0, i32 0, i1 false)			%result0 = call i32 @llvm.amdgcn.atomic.inc.i32.p3i32(i32 addrspace(3)* %ptr, i32 42, i32 0, i32 0, i1 false)
	%result1 = call i32 @llvm.amdgcn.atomic.inc.i32.p3i32(i32 addrspace(3)* %ptr, i32 42, i32 0, i32 0, i1 false)			%result1 = call i32 @llvm.amdgcn.atomic.inc.i32.p3i32(i32 addrspace(3)* %ptr, i32 42, i32 0, i32 0, i1 false)

	store i32 %result0, i32 addrspace(1)* %out0			store i32 %result0, i32 addrspace(1)* %out0
	store i32 %result1, i32 addrspace(1)* %out1			store i32 %result1, i32 addrspace(1)* %out1
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }
	attributes #2 = { nounwind argmemonly }			attributes #2 = { nounwind argmemonly }

llvm/test/CodeGen/AMDGPU/bitcast-vector-extract.ll

	; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

	; The bitcast should be pushed through the bitcasts so the vectors can			; The bitcast should be pushed through the bitcasts so the vectors can
	; be broken down and the shared components can be CSEd			; be broken down and the shared components can be CSEd

	; GCN-LABEL: {{^}}store_bitcast_constant_v8i32_to_v8f32:			; GCN-LABEL: {{^}}store_bitcast_constant_v8i32_to_v8f32:
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32			; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	define amdgpu_kernel void @store_bitcast_constant_v8i32_to_v8f32(<8 x float> addrspace(1)* %out, <8 x i32> %vec) {			define amdgpu_kernel void @store_bitcast_constant_v8i32_to_v8f32(<8 x float> addrspace(1)* %out, <8 x i32> %vec) {
	%vec0.bc = bitcast <8 x i32> <i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 8> to <8 x float>			%vec0.bc = bitcast <8 x i32> <i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 8> to <8 x float>
	store volatile <8 x float> %vec0.bc, <8 x float> addrspace(1)* %out			store volatile <8 x float> %vec0.bc, <8 x float> addrspace(1)* %out

	%vec1.bc = bitcast <8 x i32> <i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 9> to <8 x float>			%vec1.bc = bitcast <8 x i32> <i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 7, i32 9> to <8 x float>
	store volatile <8 x float> %vec1.bc, <8 x float> addrspace(1)* %out			store volatile <8 x float> %vec1.bc, <8 x float> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}store_bitcast_constant_v4i64_to_v8f32:			; GCN-LABEL: {{^}}store_bitcast_constant_v4i64_to_v8f32:
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32			; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	define amdgpu_kernel void @store_bitcast_constant_v4i64_to_v8f32(<8 x float> addrspace(1)* %out, <4 x i64> %vec) {			define amdgpu_kernel void @store_bitcast_constant_v4i64_to_v8f32(<8 x float> addrspace(1)* %out, <4 x i64> %vec) {
	%vec0.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 8> to <8 x float>			%vec0.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 8> to <8 x float>
	store volatile <8 x float> %vec0.bc, <8 x float> addrspace(1)* %out			store volatile <8 x float> %vec0.bc, <8 x float> addrspace(1)* %out

	%vec1.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 9> to <8 x float>			%vec1.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 9> to <8 x float>
	store volatile <8 x float> %vec1.bc, <8 x float> addrspace(1)* %out			store volatile <8 x float> %vec1.bc, <8 x float> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}store_bitcast_constant_v4i64_to_v4f64:			; GCN-LABEL: {{^}}store_bitcast_constant_v4i64_to_v4f64:
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32			; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	define amdgpu_kernel void @store_bitcast_constant_v4i64_to_v4f64(<4 x double> addrspace(1)* %out, <4 x i64> %vec) {			define amdgpu_kernel void @store_bitcast_constant_v4i64_to_v4f64(<4 x double> addrspace(1)* %out, <4 x i64> %vec) {
	%vec0.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 8> to <4 x double>			%vec0.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 8> to <4 x double>
	store volatile <4 x double> %vec0.bc, <4 x double> addrspace(1)* %out			store volatile <4 x double> %vec0.bc, <4 x double> addrspace(1)* %out

	%vec1.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 9> to <4 x double>			%vec1.bc = bitcast <4 x i64> <i64 7, i64 7, i64 7, i64 9> to <4 x double>
	store volatile <4 x double> %vec1.bc, <4 x double> addrspace(1)* %out			store volatile <4 x double> %vec1.bc, <4 x double> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}store_bitcast_constant_v8i32_to_v16i16:			; GCN-LABEL: {{^}}store_bitcast_constant_v8i32_to_v16i16:
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN-NOT: v_mov_b32			; GCN-NOT: v_mov_b32
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	define amdgpu_kernel void @store_bitcast_constant_v8i32_to_v16i16(<8 x float> addrspace(1)* %out, <16 x i16> %vec) {			define amdgpu_kernel void @store_bitcast_constant_v8i32_to_v16i16(<8 x float> addrspace(1)* %out, <16 x i16> %vec) {
	%vec0.bc = bitcast <16 x i16> <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 8> to <8 x float>			%vec0.bc = bitcast <16 x i16> <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 8> to <8 x float>
	store volatile <8 x float> %vec0.bc, <8 x float> addrspace(1)* %out			store volatile <8 x float> %vec0.bc, <8 x float> addrspace(1)* %out

	%vec1.bc = bitcast <16 x i16> <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 9> to <8 x float>			%vec1.bc = bitcast <16 x i16> <i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 7, i16 9> to <8 x float>
	Show All 27 Lines

llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll

	Show First 20 Lines • Show All 485 Lines • ▼ Show 20 Lines
	; sp[0] = byval			; sp[0] = byval
	; sp[1] = ??			; sp[1] = ??
	; sp[2] = stack passed workitem ID x			; sp[2] = stack passed workitem ID x

	; GCN-LABEL: {{^}}kern_call_too_many_args_use_workitem_id_x_byval:			; GCN-LABEL: {{^}}kern_call_too_many_args_use_workitem_id_x_byval:
	; GCN: enable_vgpr_workitem_id = 0			; GCN: enable_vgpr_workitem_id = 0
	; GCN-DAG: s_mov_b32 s33, s7			; GCN-DAG: s_mov_b32 s33, s7
	; GCN-DAG: v_mov_b32_e32 [[K:v[0-9]+]], 0x3e7{{$}}			; GCN-DAG: v_mov_b32_e32 [[K:v[0-9]+]], 0x3e7{{$}}
				; GCN: s_add_u32 s32, s33, 0x400{{$}}
				rampitecUnsubmitted Not Done Reply Inline Actions This is a regression I guess. A memory operation should always go before an independent ALU as it has higher latency. rampitec: This is a regression I guess. A memory operation should always go before an independent ALU as…
				qiucfAuthorUnsubmitted Done Reply Inline Actions Yes. This can and another regression in this file will be eliminated if we prevent cluster edge from being created when they can't be scheduled together, as above says. qiucf: Yes. This can and another regression in this file will be eliminated if we prevent cluster edge…
	; GCN: buffer_store_dword [[K]], off, s[0:3], s33 offset:4			; GCN: buffer_store_dword [[K]], off, s[0:3], s33 offset:4
				; GCN: buffer_store_dword v0, off, s[0:3], s32 offset:4
	; GCN: buffer_load_dword [[RELOAD_BYVAL:v[0-9]+]], off, s[0:3], s33 offset:4			; GCN: buffer_load_dword [[RELOAD_BYVAL:v[0-9]+]], off, s[0:3], s33 offset:4
	; GCN: s_add_u32 s32, s33, 0x400{{$}}

	; GCN-NOT: s32			; GCN-NOT: s32
	; GCN: buffer_store_dword v0, off, s[0:3], s32 offset:4

	; GCN: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}			; GCN: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}
	; GCN: v_mov_b32_e32 [[RELOAD_BYVAL]],			; GCN: v_mov_b32_e32 [[RELOAD_BYVAL]],
	; GCN: s_swappc_b64			; GCN: s_swappc_b64
	define amdgpu_kernel void @kern_call_too_many_args_use_workitem_id_x_byval() #1 {			define amdgpu_kernel void @kern_call_too_many_args_use_workitem_id_x_byval() #1 {
	%alloca = alloca i32, align 4, addrspace(5)			%alloca = alloca i32, align 4, addrspace(5)
	store volatile i32 999, i32 addrspace(5)* %alloca			store volatile i32 999, i32 addrspace(5)* %alloca
	call void @too_many_args_use_workitem_id_x_byval(			call void @too_many_args_use_workitem_id_x_byval(
	i32 10, i32 20, i32 30, i32 40,			i32 10, i32 20, i32 30, i32 40,
	i32 50, i32 60, i32 70, i32 80,			i32 50, i32 60, i32 70, i32 80,
	i32 90, i32 100, i32 110, i32 120,			i32 90, i32 100, i32 110, i32 120,
	i32 130, i32 140, i32 150, i32 160,			i32 130, i32 140, i32 150, i32 160,
	i32 170, i32 180, i32 190, i32 200,			i32 170, i32 180, i32 190, i32 200,
	i32 210, i32 220, i32 230, i32 240,			i32 210, i32 220, i32 230, i32 240,
	i32 250, i32 260, i32 270, i32 280,			i32 250, i32 260, i32 270, i32 280,
	i32 290, i32 300, i32 310, i32 320,			i32 290, i32 300, i32 310, i32 320,
	i32 addrspace(5)* %alloca)			i32 addrspace(5)* %alloca)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}func_call_too_many_args_use_workitem_id_x_byval:			; GCN-LABEL: {{^}}func_call_too_many_args_use_workitem_id_x_byval:
	; GCN: v_mov_b32_e32 [[K:v[0-9]+]], 0x3e7{{$}}			; GCN: v_mov_b32_e32 [[K:v[0-9]+]], 0x3e7{{$}}
	; GCN: buffer_store_dword [[K]], off, s[0:3], s34{{$}}			; GCN: buffer_store_dword [[K]], off, s[0:3], s34{{$}}
	; GCN: buffer_load_dword [[RELOAD_BYVAL:v[0-9]+]], off, s[0:3], s34{{$}}
	; GCN: buffer_store_dword v0, off, s[0:3], s32 offset:4			; GCN: buffer_store_dword v0, off, s[0:3], s32 offset:4
				; GCN: buffer_load_dword [[RELOAD_BYVAL:v[0-9]+]], off, s[0:3], s34{{$}}
				rampitecUnsubmitted Not Done Reply Inline Actions And then loads are preferably go before stores. Loads have higher latency and their results needs to be consumed by some other instruction. So it looks like a regression to me either. rampitec: And then loads are preferably go before stores. Loads have higher latency and their results…
	; GCN: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}			; GCN: buffer_store_dword [[RELOAD_BYVAL]], off, s[0:3], s32{{$}}
	; GCN: v_mov_b32_e32 [[RELOAD_BYVAL]],			; GCN: v_mov_b32_e32 [[RELOAD_BYVAL]],
	; GCN: s_swappc_b64			; GCN: s_swappc_b64
	define void @func_call_too_many_args_use_workitem_id_x_byval() #1 {			define void @func_call_too_many_args_use_workitem_id_x_byval() #1 {
	%alloca = alloca i32, align 4, addrspace(5)			%alloca = alloca i32, align 4, addrspace(5)
	store volatile i32 999, i32 addrspace(5)* %alloca			store volatile i32 999, i32 addrspace(5)* %alloca
	call void @too_many_args_use_workitem_id_x_byval(			call void @too_many_args_use_workitem_id_x_byval(
	i32 10, i32 20, i32 30, i32 40,			i32 10, i32 20, i32 30, i32 40,
	▲ Show 20 Lines • Show All 207 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/captured-frame-index.ll

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @stored_fi_to_lds_2_small_objects(float addrspace(5)* addrspace(3)* %ptr) #0 {
store volatile float addrspace(5)* %tmp0, float addrspace(5)* addrspace(3)* %ptr		store volatile float addrspace(5)* %tmp0, float addrspace(5)* addrspace(3)* %ptr
store volatile float addrspace(5)* %tmp1, float addrspace(5)* addrspace(3)* %ptr		store volatile float addrspace(5)* %tmp1, float addrspace(5)* addrspace(3)* %ptr
ret void		ret void
}		}

; Same frame index is used multiple times in the store		; Same frame index is used multiple times in the store
; GCN-LABEL: {{^}}stored_fi_to_self:		; GCN-LABEL: {{^}}stored_fi_to_self:
; GCN-DAG: v_mov_b32_e32 [[K:v[0-9]+]], 0x4d2{{$}}		; GCN-DAG: v_mov_b32_e32 [[K:v[0-9]+]], 0x4d2{{$}}
; GCN: buffer_store_dword [[K]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4{{$}}
; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 4{{$}}		; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 4{{$}}
		; GCN: buffer_store_dword [[K]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4{{$}}
		rampitecUnsubmitted Not Done Reply Inline Actions And then this is a progression, as two stores are scheduled together. It would be nice to understand if they were clustered or is it a coincidence. rampitec: And then this is a progression, as two stores are scheduled together. It would be nice to…
; GCN: buffer_store_dword [[ZERO]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4{{$}}		; GCN: buffer_store_dword [[ZERO]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4{{$}}
define amdgpu_kernel void @stored_fi_to_self() #0 {		define amdgpu_kernel void @stored_fi_to_self() #0 {
%tmp = alloca i32 addrspace(5)*, addrspace(5)		%tmp = alloca i32 addrspace(5)*, addrspace(5)

; Avoid optimizing everything out		; Avoid optimizing everything out
store volatile i32 addrspace(5)* inttoptr (i32 1234 to i32 addrspace(5)), i32 addrspace(5) addrspace(5)* %tmp		store volatile i32 addrspace(5)* inttoptr (i32 1234 to i32 addrspace(5)), i32 addrspace(5) addrspace(5)* %tmp
%bitcast = bitcast i32 addrspace(5)* addrspace(5)* %tmp to i32 addrspace(5)*		%bitcast = bitcast i32 addrspace(5)* addrspace(5)* %tmp to i32 addrspace(5)*
store volatile i32 addrspace(5)* %bitcast, i32 addrspace(5)* addrspace(5)* %tmp		store volatile i32 addrspace(5)* %bitcast, i32 addrspace(5)* addrspace(5)* %tmp
ret void		ret void
}		}

; GCN-LABEL: {{^}}stored_fi_to_self_offset:		; GCN-LABEL: {{^}}stored_fi_to_self_offset:
; GCN-DAG: v_mov_b32_e32 [[K0:v[0-9]+]], 32{{$}}		; GCN-DAG: v_mov_b32_e32 [[K0:v[0-9]+]], 32{{$}}
; GCN: buffer_store_dword [[K0]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4{{$}}

; GCN-DAG: v_mov_b32_e32 [[K1:v[0-9]+]], 0x4d2{{$}}		; GCN-DAG: v_mov_b32_e32 [[K1:v[0-9]+]], 0x4d2{{$}}

		; GCN: buffer_store_dword [[K0]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4{{$}}
		rampitecUnsubmitted Not Done Reply Inline Actions Same here. It seems to help store clustering. rampitec: Same here. It seems to help store clustering.
; GCN: buffer_store_dword [[K1]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:2052{{$}}		; GCN: buffer_store_dword [[K1]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:2052{{$}}

; GCN: v_mov_b32_e32 [[OFFSETK:v[0-9]+]], 0x804{{$}}		; GCN: v_mov_b32_e32 [[OFFSETK:v[0-9]+]], 0x804{{$}}
; GCN: buffer_store_dword [[OFFSETK]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:2052{{$}}		; GCN: buffer_store_dword [[OFFSETK]], off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:2052{{$}}
define amdgpu_kernel void @stored_fi_to_self_offset() #0 {		define amdgpu_kernel void @stored_fi_to_self_offset() #0 {
%tmp0 = alloca [512 x i32], addrspace(5)		%tmp0 = alloca [512 x i32], addrspace(5)
%tmp1 = alloca i32 addrspace(5)*, addrspace(5)		%tmp1 = alloca i32 addrspace(5)*, addrspace(5)

▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll

Show First 20 Lines • Show All 358 Lines • ▼ Show 20 Lines	bb:
%result = insertelement <2 x i16> %op.hi, i16 %load_lo, i32 0		%result = insertelement <2 x i16> %op.hi, i16 %load_lo, i32 0
ret <2 x i16> %result		ret <2 x i16> %result
}		}

define <2 x i16> @chain_hi_to_lo_group_may_alias_store(i16 addrspace(3)* %ptr, i16 addrspace(3)* %may.alias) {		define <2 x i16> @chain_hi_to_lo_group_may_alias_store(i16 addrspace(3)* %ptr, i16 addrspace(3)* %may.alias) {
; GCN-LABEL: chain_hi_to_lo_group_may_alias_store:		; GCN-LABEL: chain_hi_to_lo_group_may_alias_store:
; GCN: ; %bb.0: ; %bb		; GCN: ; %bb.0: ; %bb
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v3, 0x7b		; GCN-NEXT: v_mov_b32_e32 v2, 0x7b
; GCN-NEXT: ds_read_u16 v2, v0		; GCN-NEXT: ds_read_u16 v3, v0
; GCN-NEXT: ds_write_b16 v1, v3		; GCN-NEXT: ds_write_b16 v1, v2
; GCN-NEXT: ds_read_u16 v0, v0 offset:2		; GCN-NEXT: ds_read_u16 v0, v0 offset:2
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; GCN-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: v_and_b32_e32 v0, 0xffff, v0		; GCN-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GCN-NEXT: v_lshl_or_b32 v0, v2, 16, v0		; GCN-NEXT: v_lshl_or_b32 v0, v3, 16, v0
; GCN-NEXT: s_setpc_b64 s[30:31]		; GCN-NEXT: s_setpc_b64 s[30:31]
bb:		bb:
%gep_lo = getelementptr inbounds i16, i16 addrspace(3)* %ptr, i64 1		%gep_lo = getelementptr inbounds i16, i16 addrspace(3)* %ptr, i64 1
%gep_hi = getelementptr inbounds i16, i16 addrspace(3)* %ptr, i64 0		%gep_hi = getelementptr inbounds i16, i16 addrspace(3)* %ptr, i64 0
%load_hi = load i16, i16 addrspace(3)* %gep_hi		%load_hi = load i16, i16 addrspace(3)* %gep_hi
store i16 123, i16 addrspace(3)* %may.alias		store i16 123, i16 addrspace(3)* %may.alias
%load_lo = load i16, i16 addrspace(3)* %gep_lo		%load_lo = load i16, i16 addrspace(3)* %gep_lo

%to.hi = insertelement <2 x i16> undef, i16 %load_hi, i32 1		%to.hi = insertelement <2 x i16> undef, i16 %load_hi, i32 1
%result = insertelement <2 x i16> %to.hi, i16 %load_lo, i32 0		%result = insertelement <2 x i16> %to.hi, i16 %load_lo, i32 0
ret <2 x i16> %result		ret <2 x i16> %result
}		}

llvm/test/CodeGen/AMDGPU/extract_vector_elt-i8.ll

Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @extract_vector_elt_v16i8(i8 addrspace(1)* %out, <16 x i8> %foo) #0 {
ret void		ret void
}		}

; GCN-LABEL: {{^}}extract_vector_elt_v32i8:		; GCN-LABEL: {{^}}extract_vector_elt_v32i8:
; GCN-NOT: {{s\|flat\|buffer\|global}}_load		; GCN-NOT: {{s\|flat\|buffer\|global}}_load
; GCN: s_load_dword [[VAL:s[0-9]+]]		; GCN: s_load_dword [[VAL:s[0-9]+]]
; GCN-NOT: {{s\|flat\|buffer\|global}}_load		; GCN-NOT: {{s\|flat\|buffer\|global}}_load
; GCN: s_lshr_b32 [[ELT2:s[0-9]+]], [[VAL]], 16		; GCN: s_lshr_b32 [[ELT2:s[0-9]+]], [[VAL]], 16
; GCN-DAG: v_mov_b32_e32 [[V_LOAD0:v[0-9]+]], s{{[0-9]+}}
; GCN-DAG: v_mov_b32_e32 [[V_ELT2:v[0-9]+]], [[ELT2]]		; GCN-DAG: v_mov_b32_e32 [[V_ELT2:v[0-9]+]], [[ELT2]]
		; GCN-DAG: v_mov_b32_e32 [[V_LOAD0:v[0-9]+]], [[VAL]]
; GCN: buffer_store_byte [[V_ELT2]]		; GCN: buffer_store_byte [[V_ELT2]]
; GCN: buffer_store_byte [[V_LOAD0]]		; GCN: buffer_store_byte [[V_LOAD0]]
define amdgpu_kernel void @extract_vector_elt_v32i8(<32 x i8> %foo) #0 {		define amdgpu_kernel void @extract_vector_elt_v32i8(<32 x i8> %foo) #0 {
%p0 = extractelement <32 x i8> %foo, i32 0		%p0 = extractelement <32 x i8> %foo, i32 0
%p1 = extractelement <32 x i8> %foo, i32 2		%p1 = extractelement <32 x i8> %foo, i32 2
store volatile i8 %p1, i8 addrspace(1)* null		store volatile i8 %p1, i8 addrspace(1)* null
store volatile i8 %p0, i8 addrspace(1)* null		store volatile i8 %p0, i8 addrspace(1)* null
ret void		ret void
▲ Show 20 Lines • Show All 161 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll

Show All 14 Lines
; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX7-ALIGNED-NEXT: v_lshlrev_b32_e32 v1, 16, v1		; GFX7-ALIGNED-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX7-ALIGNED-NEXT: v_or_b32_e32 v0, v0, v1		; GFX7-ALIGNED-NEXT: v_or_b32_e32 v0, v0, v1
; GFX7-ALIGNED-NEXT: s_setpc_b64 s[30:31]		; GFX7-ALIGNED-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX7-UNALIGNED-LABEL: global_load_2xi16_align2:		; GFX7-UNALIGNED-LABEL: global_load_2xi16_align2:
; GFX7-UNALIGNED: ; %bb.0:		; GFX7-UNALIGNED: ; %bb.0:
; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-UNALIGNED-NEXT: v_add_i32_e32 v2, vcc, 2, v0		; GFX7-UNALIGNED-NEXT: flat_load_dword v0, v[0:1]
; GFX7-UNALIGNED-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc
; GFX7-UNALIGNED-NEXT: flat_load_ushort v0, v[0:1]
; GFX7-UNALIGNED-NEXT: flat_load_ushort v1, v[2:3]
; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX7-UNALIGNED-NEXT: v_lshlrev_b32_e32 v1, 16, v1
; GFX7-UNALIGNED-NEXT: v_or_b32_e32 v0, v0, v1
; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]		; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: global_load_2xi16_align2:		; GFX9-LABEL: global_load_2xi16_align2:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_ushort v2, v[0:1], off		; GFX9-NEXT: global_load_dword v0, v[0:1], off
; GFX9-NEXT: global_load_ushort v0, v[0:1], off offset:2		; GFX9-NEXT: v_mov_b32_e32 v1, 0xffff
		; GFX9-NEXT: s_mov_b32 s4, 0xffff
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_lshl_or_b32 v0, v0, 16, v2		; GFX9-NEXT: v_bfi_b32 v1, v1, 0, v0
		; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1		%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1
%p.0 = load i16, i16 addrspace(1)* %p, align 2		%p.0 = load i16, i16 addrspace(1)* %p, align 2
%p.1 = load i16, i16 addrspace(1)* %gep.p, align 2		%p.1 = load i16, i16 addrspace(1)* %gep.p, align 2
%zext.0 = zext i16 %p.0 to i32		%zext.0 = zext i16 %p.0 to i32
%zext.1 = zext i16 %p.1 to i32		%zext.1 = zext i16 %p.1 to i32
%shl.1 = shl i32 %zext.1, 16		%shl.1 = shl i32 %zext.1, 16
%or = or i32 %zext.0, %shl.1		%or = or i32 %zext.0, %shl.1
ret i32 %or		ret i32 %or
}		}

; Should not merge this to a dword store		; Should not merge this to a dword store
define amdgpu_kernel void @global_store_2xi16_align2(i16 addrspace(1)* %p, i16 addrspace(1)* %r) #0 {		define amdgpu_kernel void @global_store_2xi16_align2(i16 addrspace(1)* %p, i16 addrspace(1)* %r) #0 {
; GFX7-ALIGNED-LABEL: global_store_2xi16_align2:		; GFX7-ALIGNED-LABEL: global_store_2xi16_align2:
; GFX7-ALIGNED: ; %bb.0:		; GFX7-ALIGNED: ; %bb.0:
; GFX7-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2		; GFX7-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v2, 1		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v4, 1
		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v5, 2
; GFX7-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX7-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v0, s0
; GFX7-ALIGNED-NEXT: s_add_u32 s2, s0, 2		; GFX7-ALIGNED-NEXT: s_add_u32 s2, s0, 2
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v1, s1		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v0, s0
; GFX7-ALIGNED-NEXT: flat_store_short v[0:1], v2
; GFX7-ALIGNED-NEXT: s_addc_u32 s3, s1, 0		; GFX7-ALIGNED-NEXT: s_addc_u32 s3, s1, 0
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v0, s2		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v2, s2
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v2, 2		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v1, s1
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v1, s3		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v3, s3
; GFX7-ALIGNED-NEXT: flat_store_short v[0:1], v2		; GFX7-ALIGNED-NEXT: flat_store_short v[0:1], v4
		rampitecUnsubmitted Not Done Reply Inline Actions Stores are clustered again, here and below, which is nice improvement. rampitec: Stores are clustered again, here and below, which is nice improvement.
		; GFX7-ALIGNED-NEXT: flat_store_short v[2:3], v5
; GFX7-ALIGNED-NEXT: s_endpgm		; GFX7-ALIGNED-NEXT: s_endpgm
;		;
; GFX7-UNALIGNED-LABEL: global_store_2xi16_align2:		; GFX7-UNALIGNED-LABEL: global_store_2xi16_align2:
; GFX7-UNALIGNED: ; %bb.0:		; GFX7-UNALIGNED: ; %bb.0:
; GFX7-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2		; GFX7-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v2, 1		; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0x20001
; GFX7-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0		; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
; GFX7-UNALIGNED-NEXT: s_add_u32 s2, s0, 2
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v1, s1		; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v1, s1
; GFX7-UNALIGNED-NEXT: flat_store_short v[0:1], v2		; GFX7-UNALIGNED-NEXT: flat_store_dword v[0:1], v2
; GFX7-UNALIGNED-NEXT: s_addc_u32 s3, s1, 0
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v0, s2
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v2, 2
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v1, s3
; GFX7-UNALIGNED-NEXT: flat_store_short v[0:1], v2
; GFX7-UNALIGNED-NEXT: s_endpgm		; GFX7-UNALIGNED-NEXT: s_endpgm
;		;
; GFX9-LABEL: global_store_2xi16_align2:		; GFX9-LABEL: global_store_2xi16_align2:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x8		; GFX9-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x8
; GFX9-NEXT: v_mov_b32_e32 v2, 1		; GFX9-NEXT: v_mov_b32_e32 v2, 0x20001
; GFX9-NEXT: v_mov_b32_e32 v3, 2
; GFX9-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: v_mov_b32_e32 v0, s0		; GFX9-NEXT: v_mov_b32_e32 v0, s0
; GFX9-NEXT: v_mov_b32_e32 v1, s1		; GFX9-NEXT: v_mov_b32_e32 v1, s1
; GFX9-NEXT: global_store_short v[0:1], v2, off		; GFX9-NEXT: global_store_dword v[0:1], v2, off
; GFX9-NEXT: global_store_short v[0:1], v3, off offset:2
; GFX9-NEXT: s_endpgm		; GFX9-NEXT: s_endpgm
%gep.r = getelementptr i16, i16 addrspace(1)* %r, i64 1		%gep.r = getelementptr i16, i16 addrspace(1)* %r, i64 1
store i16 1, i16 addrspace(1)* %r, align 2		store i16 1, i16 addrspace(1)* %r, align 2
store i16 2, i16 addrspace(1)* %gep.r, align 2		store i16 2, i16 addrspace(1)* %gep.r, align 2
ret void		ret void
}		}

; Should produce align 1 dword when legal		; Should produce align 1 dword when legal
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	; GFX9-NEXT: s_setpc_b64 s[30:31]
ret i32 %or		ret i32 %or
}		}

; Should produce align 1 dword when legal		; Should produce align 1 dword when legal
define amdgpu_kernel void @global_store_2xi16_align1(i16 addrspace(1)* %p, i16 addrspace(1)* %r) #0 {		define amdgpu_kernel void @global_store_2xi16_align1(i16 addrspace(1)* %p, i16 addrspace(1)* %r) #0 {
; GFX7-ALIGNED-LABEL: global_store_2xi16_align1:		; GFX7-ALIGNED-LABEL: global_store_2xi16_align1:
; GFX7-ALIGNED: ; %bb.0:		; GFX7-ALIGNED: ; %bb.0:
; GFX7-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2		; GFX7-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v4, 1		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v8, 1
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v5, 0		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v9, 0
		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v10, 2
; GFX7-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX7-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX7-ALIGNED-NEXT: s_add_u32 s2, s0, 2		; GFX7-ALIGNED-NEXT: s_add_u32 s2, s0, 2
; GFX7-ALIGNED-NEXT: s_addc_u32 s3, s1, 0		; GFX7-ALIGNED-NEXT: s_addc_u32 s3, s1, 0
; GFX7-ALIGNED-NEXT: s_add_u32 s4, s0, 1		; GFX7-ALIGNED-NEXT: s_add_u32 s4, s0, 1
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v0, s0		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v0, s0
; GFX7-ALIGNED-NEXT: s_addc_u32 s5, s1, 0		; GFX7-ALIGNED-NEXT: s_addc_u32 s5, s1, 0
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v1, s1		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v1, s1
; GFX7-ALIGNED-NEXT: s_add_u32 s0, s0, 3		; GFX7-ALIGNED-NEXT: s_add_u32 s0, s0, 3
		; GFX7-ALIGNED-NEXT: s_addc_u32 s1, s1, 0
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v2, s4		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v2, s4
		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v5, s1
		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v7, s3
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v3, s5		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v3, s5
; GFX7-ALIGNED-NEXT: flat_store_byte v[0:1], v4		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v4, s0
; GFX7-ALIGNED-NEXT: flat_store_byte v[2:3], v5		; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v6, s2
; GFX7-ALIGNED-NEXT: s_addc_u32 s1, s1, 0		; GFX7-ALIGNED-NEXT: flat_store_byte v[0:1], v8
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v0, s0		; GFX7-ALIGNED-NEXT: flat_store_byte v[2:3], v9
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v2, s2		; GFX7-ALIGNED-NEXT: flat_store_byte v[4:5], v9
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v1, s1		; GFX7-ALIGNED-NEXT: flat_store_byte v[6:7], v10
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v4, 2
; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v3, s3
; GFX7-ALIGNED-NEXT: flat_store_byte v[0:1], v5
; GFX7-ALIGNED-NEXT: flat_store_byte v[2:3], v4
; GFX7-ALIGNED-NEXT: s_endpgm		; GFX7-ALIGNED-NEXT: s_endpgm
;		;
; GFX7-UNALIGNED-LABEL: global_store_2xi16_align1:		; GFX7-UNALIGNED-LABEL: global_store_2xi16_align1:
; GFX7-UNALIGNED: ; %bb.0:		; GFX7-UNALIGNED: ; %bb.0:
; GFX7-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2		; GFX7-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0x20001		; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v2, 0x20001
; GFX7-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0		; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0
▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/global-saddr.ll

Show All 39 Lines	entry:
%add5 = add i64 %add2, %add1		%add5 = add i64 %add2, %add1
%add6 = add i64 %add4, %add3		%add6 = add i64 %add4, %add3
%add7 = add i64 %add6, %add5		%add7 = add i64 %add6, %add5
%gep9 = getelementptr i64, i64 addrspace(1)* %dst_image, i64 %idx		%gep9 = getelementptr i64, i64 addrspace(1)* %dst_image, i64 %idx
%ptr9 = getelementptr inbounds i64, i64 addrspace(1)* %gep9, i64 1		%ptr9 = getelementptr inbounds i64, i64 addrspace(1)* %gep9, i64 1
store volatile i64 %add7, i64 addrspace(1)* %ptr9		store volatile i64 %add7, i64 addrspace(1)* %ptr9

; Test various offset boundaries.		; Test various offset boundaries.
; GFX9: global_load_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, off offset:4088{{$}}
; GFX9: global_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+}}:{{[0-9]+}}] offset:2040{{$}}		; GFX9: global_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+}}:{{[0-9]+}}] offset:2040{{$}}
		; GFX9: global_load_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, off offset:4088{{$}}
; GFX9: global_load_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+}}:{{[0-9]+}}] offset:4088{{$}}		; GFX9: global_load_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+}}:{{[0-9]+}}] offset:4088{{$}}
%gep11 = getelementptr inbounds i64, i64 addrspace(1)* %gep, i64 511		%gep11 = getelementptr inbounds i64, i64 addrspace(1)* %gep, i64 511
%load11 = load i64, i64 addrspace(1)* %gep11		%load11 = load i64, i64 addrspace(1)* %gep11
%gep12 = getelementptr inbounds i64, i64 addrspace(1)* %gep, i64 1023		%gep12 = getelementptr inbounds i64, i64 addrspace(1)* %gep, i64 1023
%load12 = load i64, i64 addrspace(1)* %gep12		%load12 = load i64, i64 addrspace(1)* %gep12
%gep13 = getelementptr inbounds i64, i64 addrspace(1)* %gep, i64 255		%gep13 = getelementptr inbounds i64, i64 addrspace(1)* %gep, i64 255
%load13 = load i64, i64 addrspace(1)* %gep13		%load13 = load i64, i64 addrspace(1)* %gep13
%add11 = add i64 %load11, %load12		%add11 = add i64 %load11, %load12
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/sched-assert-onlydbg-value-empty-region.mir

Show All 29 Lines	body: \|
; CHECK: [[DEF1:%[0-9]+]]:vreg_64 = IMPLICIT_DEF		; CHECK: [[DEF1:%[0-9]+]]:vreg_64 = IMPLICIT_DEF
; CHECK: [[DEF2:%[0-9]+]]:vreg_64 = IMPLICIT_DEF		; CHECK: [[DEF2:%[0-9]+]]:vreg_64 = IMPLICIT_DEF
; CHECK: [[DEF3:%[0-9]+]]:vreg_64 = IMPLICIT_DEF		; CHECK: [[DEF3:%[0-9]+]]:vreg_64 = IMPLICIT_DEF
; CHECK: undef %11.sub1:vreg_64 = IMPLICIT_DEF		; CHECK: undef %11.sub1:vreg_64 = IMPLICIT_DEF
; CHECK: [[DEF4:%[0-9]+]]:vreg_64 = IMPLICIT_DEF		; CHECK: [[DEF4:%[0-9]+]]:vreg_64 = IMPLICIT_DEF
; CHECK: [[DEF5:%[0-9]+]]:vreg_64 = IMPLICIT_DEF		; CHECK: [[DEF5:%[0-9]+]]:vreg_64 = IMPLICIT_DEF
; CHECK: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec		; CHECK: [[V_MOV_B32_e32_:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
; CHECK: [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec		; CHECK: [[V_MOV_B32_e32_1:%[0-9]+]]:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
; CHECK: [[DEF6:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; CHECK: [[DEF7:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; CHECK: [[COPY1:%[0-9]+]]:vreg_64 = COPY [[GLOBAL_LOAD_DWORDX2_]]		; CHECK: [[COPY1:%[0-9]+]]:vreg_64 = COPY [[GLOBAL_LOAD_DWORDX2_]]
; CHECK: undef %6.sub0:vreg_64 = V_ADD_F32_e32 [[DEF]].sub0, [[COPY1]].sub0, implicit $exec		; CHECK: undef %6.sub0:vreg_64 = V_ADD_F32_e32 [[DEF]].sub0, [[COPY1]].sub0, implicit $exec
; CHECK: dead undef %6.sub1:vreg_64 = V_ADD_F32_e32 [[DEF]].sub1, [[COPY1]].sub0, implicit $exec		; CHECK: dead undef %6.sub1:vreg_64 = V_ADD_F32_e32 [[DEF]].sub1, [[COPY1]].sub0, implicit $exec
; CHECK: [[GLOBAL_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD [[COPY1]], 0, 0, 0, 0, implicit $exec		; CHECK: [[GLOBAL_LOAD_DWORD1:%[0-9]+]]:vgpr_32 = GLOBAL_LOAD_DWORD [[COPY1]], 0, 0, 0, 0, implicit $exec
		; CHECK: [[DEF6:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
		; CHECK: [[DEF7:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; CHECK: [[DEF8:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF		; CHECK: [[DEF8:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; CHECK: undef %19.sub0:vreg_64 = V_ADD_F32_e32 [[GLOBAL_LOAD_DWORD1]], [[GLOBAL_LOAD_DWORDX2_]].sub0, implicit $exec		; CHECK: undef %19.sub0:vreg_64 = V_ADD_F32_e32 [[GLOBAL_LOAD_DWORD1]], [[GLOBAL_LOAD_DWORDX2_]].sub0, implicit $exec
; CHECK: %19.sub1:vreg_64 = V_ADD_F32_e32 [[GLOBAL_LOAD_DWORD]], [[GLOBAL_LOAD_DWORD]], implicit $exec		; CHECK: %19.sub1:vreg_64 = V_ADD_F32_e32 [[GLOBAL_LOAD_DWORD]], [[GLOBAL_LOAD_DWORD]], implicit $exec
; CHECK: GLOBAL_STORE_DWORDX2 %19, %4, 32, 0, 0, 0, implicit $exec		; CHECK: GLOBAL_STORE_DWORDX2 %19, %4, 32, 0, 0, 0, implicit $exec
; CHECK: %11.sub0:vreg_64 = GLOBAL_LOAD_DWORD [[DEF1]], 0, 0, 0, 0, implicit $exec		; CHECK: %11.sub0:vreg_64 = GLOBAL_LOAD_DWORD [[DEF1]], 0, 0, 0, 0, implicit $exec
; CHECK: [[DEF2]].sub0:vreg_64 = GLOBAL_LOAD_DWORD [[DEF3]], 0, 0, 0, 0, implicit $exec		; CHECK: [[DEF2]].sub0:vreg_64 = GLOBAL_LOAD_DWORD [[DEF3]], 0, 0, 0, 0, implicit $exec
; CHECK: dead %20:vgpr_32 = GLOBAL_LOAD_DWORD %11, 0, 0, 0, 0, implicit $exec		; CHECK: dead %20:vgpr_32 = GLOBAL_LOAD_DWORD %11, 0, 0, 0, 0, implicit $exec
; CHECK: dead %21:vgpr_32 = GLOBAL_LOAD_DWORD [[DEF4]], 0, 0, 0, 0, implicit $exec		; CHECK: dead %21:vgpr_32 = GLOBAL_LOAD_DWORD [[DEF4]], 0, 0, 0, 0, implicit $exec
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/sign_extend.ll

	Show First 20 Lines • Show All 335 Lines • ▼ Show 20 Lines
	define amdgpu_kernel void @s_sext_v4i8_to_v4i32(i32 addrspace(1)* %out, i32 %a) nounwind {			define amdgpu_kernel void @s_sext_v4i8_to_v4i32(i32 addrspace(1)* %out, i32 %a) nounwind {
	; SI-LABEL: s_sext_v4i8_to_v4i32:			; SI-LABEL: s_sext_v4i8_to_v4i32:
	; SI: ; %bb.0:			; SI: ; %bb.0:
	; SI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x9			; SI-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x9
	; SI-NEXT: s_load_dword s0, s[0:1], 0xb			; SI-NEXT: s_load_dword s0, s[0:1], 0xb
	; SI-NEXT: s_mov_b32 s7, 0xf000			; SI-NEXT: s_mov_b32 s7, 0xf000
	; SI-NEXT: s_mov_b32 s6, -1			; SI-NEXT: s_mov_b32 s6, -1
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
				; SI-NEXT: s_bfe_i32 s3, s0, 0x80008
	; SI-NEXT: s_ashr_i32 s1, s0, 24			; SI-NEXT: s_ashr_i32 s1, s0, 24
	; SI-NEXT: s_bfe_i32 s2, s0, 0x80010			; SI-NEXT: s_bfe_i32 s2, s0, 0x80010
	; SI-NEXT: s_bfe_i32 s3, s0, 0x80008
	; SI-NEXT: s_sext_i32_i8 s0, s0			; SI-NEXT: s_sext_i32_i8 s0, s0
	; SI-NEXT: v_mov_b32_e32 v0, s0			; SI-NEXT: v_mov_b32_e32 v0, s0
				; SI-NEXT: v_mov_b32_e32 v1, s3
	; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0			; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: buffer_store_dword v1, off, s[4:7], 0
	; SI-NEXT: v_mov_b32_e32 v0, s3			; SI-NEXT: s_waitcnt expcnt(1)
	; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0
	; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s2			; SI-NEXT: v_mov_b32_e32 v0, s2
	; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0			; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s1			; SI-NEXT: v_mov_b32_e32 v0, s1
	; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0			; SI-NEXT: buffer_store_dword v0, off, s[4:7], 0
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: s_sext_v4i8_to_v4i32:			; VI-LABEL: s_sext_v4i8_to_v4i32:
	▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	; SI-NEXT: s_mov_b32 s3, 0xf000			; SI-NEXT: s_mov_b32 s3, 0xf000
	; SI-NEXT: s_mov_b32 s2, -1			; SI-NEXT: s_mov_b32 s2, -1
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: s_mov_b32 s0, s4			; SI-NEXT: s_mov_b32 s0, s4
	; SI-NEXT: s_mov_b32 s1, s5			; SI-NEXT: s_mov_b32 s1, s5
	; SI-NEXT: s_ashr_i64 s[4:5], s[6:7], 48			; SI-NEXT: s_ashr_i64 s[4:5], s[6:7], 48
	; SI-NEXT: s_ashr_i32 s5, s6, 16			; SI-NEXT: s_ashr_i32 s5, s6, 16
	; SI-NEXT: s_sext_i32_i16 s6, s6			; SI-NEXT: s_sext_i32_i16 s6, s6
	; SI-NEXT: v_mov_b32_e32 v0, s6
	; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s5
	; SI-NEXT: s_sext_i32_i16 s7, s7			; SI-NEXT: s_sext_i32_i16 s7, s7
				; SI-NEXT: v_mov_b32_e32 v0, s6
				; SI-NEXT: v_mov_b32_e32 v1, s5
	; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0			; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: buffer_store_dword v1, off, s[0:3], 0
				; SI-NEXT: s_waitcnt expcnt(1)
	; SI-NEXT: v_mov_b32_e32 v0, s7			; SI-NEXT: v_mov_b32_e32 v0, s7
	; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0			; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: s_waitcnt expcnt(0)
	; SI-NEXT: v_mov_b32_e32 v0, s4			; SI-NEXT: v_mov_b32_e32 v0, s4
	; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0			; SI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: s_sext_v4i16_to_v4i32:			; VI-LABEL: s_sext_v4i16_to_v4i32:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24			; VI-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24
	; VI-NEXT: s_mov_b32 s3, 0xf000			; VI-NEXT: s_mov_b32 s3, 0xf000
	; VI-NEXT: s_mov_b32 s2, -1			; VI-NEXT: s_mov_b32 s2, -1
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: s_mov_b32 s1, s5			; VI-NEXT: s_mov_b32 s1, s5
	; VI-NEXT: s_ashr_i32 s5, s6, 16			; VI-NEXT: s_ashr_i32 s5, s6, 16
	; VI-NEXT: s_sext_i32_i16 s6, s6			; VI-NEXT: s_sext_i32_i16 s6, s6
	; VI-NEXT: s_mov_b32 s0, s4			; VI-NEXT: s_mov_b32 s0, s4
	; VI-NEXT: v_mov_b32_e32 v0, s6
	; VI-NEXT: s_ashr_i32 s4, s7, 16			; VI-NEXT: s_ashr_i32 s4, s7, 16
	; VI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; VI-NEXT: v_mov_b32_e32 v0, s5
	; VI-NEXT: s_sext_i32_i16 s7, s7			; VI-NEXT: s_sext_i32_i16 s7, s7
				; VI-NEXT: v_mov_b32_e32 v0, s6
				; VI-NEXT: v_mov_b32_e32 v1, s5
	; VI-NEXT: buffer_store_dword v0, off, s[0:3], 0			; VI-NEXT: buffer_store_dword v0, off, s[0:3], 0
				; VI-NEXT: buffer_store_dword v1, off, s[0:3], 0
	; VI-NEXT: v_mov_b32_e32 v0, s7			; VI-NEXT: v_mov_b32_e32 v0, s7
	; VI-NEXT: buffer_store_dword v0, off, s[0:3], 0			; VI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; VI-NEXT: v_mov_b32_e32 v0, s4			; VI-NEXT: v_mov_b32_e32 v0, s4
	; VI-NEXT: buffer_store_dword v0, off, s[0:3], 0			; VI-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	%cast = bitcast i64 %a to <4 x i16>			%cast = bitcast i64 %a to <4 x i16>
	%ext = sext <4 x i16> %cast to <4 x i32>			%ext = sext <4 x i16> %cast to <4 x i32>
	%elt0 = extractelement <4 x i32> %ext, i32 0			%elt0 = extractelement <4 x i32> %ext, i32 0
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines