Download Raw Diff

Details

Reviewers

fhahn
evandro
arsenm
jsji
hfinkel
kbarton
rampitec
foad

Group Reviewers

Restricted Project

Commits

rG3359ea62edcc: [Scheduling] Create the missing dependency edges for store cluster

Summary

If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa as they both depend on the base register "reg"

     +-------+
+---->  reg  |
|    +---+---+
|        ^
|        |
|        |
|        |
|    +---+---+
|    |  SUa  |  Load 0(reg)
|    +---+---+
|        ^
|        |
|        |
|    +---+---+
+----+  SUb  |  Load 4(reg)
     +-------+

But if it is store cluster, we need to create it as follow shows to avoid the instruction store depend on scheduled in-between SUb and SUa. Notice that, AMDGPU several cases break due to this change and I examine them one by one, no obvious issue found from scheduling aspect. However, need double confirm. And one case show some deg in the final code sequence, which seems to be an issue of the later pass I think.

     +-------+
+---->  reg  |
|    +---+---+
|        ^
|        |         Missing       +-------+
|        | +-------------------->+   y   |
|        | |                     +---+---+
|    +---+-+-+                       ^
|    |  SUa  |  Store x 0(reg)       |
|    +---+---+                       |
|        ^                           |
|        |  +------------------------+
|        |  |
|    +---+--++
+----+  SUb  |  Store y 4(reg)
     +-------+

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

steven.zhang created this revision.Dec 31 2019, 1:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 31 2019, 1:57 AM

Herald added subscribers: javed.absar, hiraditya, tpr and 4 others. · View Herald Transcript

steven.zhang edited the summary of this revision. (Show Details)Dec 31 2019, 1:58 AM

steven.zhang updated this revision to Diff 235692.Dec 31 2019, 2:02 AM

steven.zhang marked an inline comment as done.

steven.zhang added inline comments.

llvm/test/CodeGen/AMDGPU/byval-frame-setup.ll
339 ↗	(On Diff #235692)	This case has some deg as it generates more instructions. But the output of the scheduler is as expected. An issue of the later pass ?

fhahn added inline comments.Dec 31 2019, 4:24 AM

llvm/test/CodeGen/AArch64/macro-fusion.ll
22 ↗	(On Diff #235692)	This should probably go into a different file, as it’s load/store clustering and not macro fusion

Address comments about the test case.

Ping ...

This adds new dependencies, hence I think it would be good to gather code size/perf numbers with this change for some impacted targets (e.g. AArch64) to be reasonably sure that there are no unexpected knock-on effects.

llvm/lib/CodeGen/MachineScheduler.cpp
1602	I may be missing something, but IIRC stores could have other memory operations as successors, e.g. because to enforce an ordering between aliasing memory operations.

steven.zhang marked an inline comment as done.Jan 7 2020, 3:22 AM

steven.zhang added inline comments.

llvm/lib/CodeGen/MachineScheduler.cpp
1602	It they have memory dependency, they won't be put into the same group.

fhahn added inline comments.Jan 7 2020, 3:39 AM

llvm/lib/CodeGen/MachineScheduler.cpp
1602	Ah right. The wording in the comment comment seems a bit general though, maybe it would be possible to mention that this is a constraint for the clustering. It might be good to add it as assert as well :)

steven.zhang marked an inline comment as done.Jan 7 2020, 5:07 AM

steven.zhang added inline comments.

llvm/lib/CodeGen/MachineScheduler.cpp
1602	ok. I will update the comments and the assertion if it is easy ... Thank you for the comments!

In D72031#1807588, @fhahn wrote:

This adds new dependencies, hence I think it would be good to gather code size/perf numbers with this change for some impacted targets (e.g. AArch64) to be reasonably sure that there are no unexpected knock-on effects.

I can do the code size part with spec 2017 for AArch64 and other affected targets, but I cannot evaluate the perf numbers as I didn't have the env. Can someone help me get that data ? Thank you in advance!

Herald added a subscriber: herhut. · View Herald TranscriptJan 8 2020, 2:42 AM

Address the comments. And all the case changes disappear with the patch https://reviews.llvm.org/D72706 landed. So, I didn't see the negative impact from this patch any more. And from the design, it is doing the right thing, without increasing the compiling time.

Unit tests: pass. 61882 tests passed, 0 failed and 782 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Harbormaster completed remote builds in B44030: Diff 238202.Jan 15 2020, 2:56 AM

Rebase the patch and update the test change for AMDGPU.

Herald added subscribers: kerbowa, nemanjai. · View Herald TranscriptFeb 17 2020, 5:54 PM

steven.zhang updated this revision to Diff 245060.Feb 17 2020, 5:57 PM

Herald added a subscriber: • wuzish. · View Herald TranscriptFeb 17 2020, 5:57 PM

Harbormaster failed remote builds in B46670: Diff 245059!Feb 17 2020, 6:12 PM

Tests change looks neutral to me now, and the logic seems plausible. How comes that only AMDGPU tests affected?

In D72031#1880128, @rampitec wrote:

Tests change looks neutral to me now, and the logic seems plausible. How comes that only AMDGPU tests affected?

For now, only AMDGPU and AArch64 supports the mem cluster in LLVM. AArch64 only supports the back-2-back load/store cluster while AMDGPU supports 6 load/store at most to be cluster. So, it is more easy to expose this change.

LGTM, but please wait for other responses.

Would it be possible to add a test case for AArch64?

qiucf mentioned this in D74524: [Scheduling] Improve memory ops cluster preparation.Feb 19 2020, 6:49 AM

In D72031#1880821, @evandro wrote:

Would it be possible to add a test case for AArch64?

That would not be easy, but I could cook a case for AArch64 to indicate that, the internal dependency graph has been changed.

Add one test for AArch64 and rebase the patch.

Harbormaster completed remote builds in B47108: Diff 246134.Feb 23 2020, 7:35 PM

Gentle ping

Ping ...

Rebase the patch.

Harbormaster completed remote builds in B64863: Diff 279117.Jul 19 2020, 7:37 PM

Seems reasonable to me. It's just a heuristic and shouldn't affect correctness, right?

In D72031#2162539, @foad wrote:

Seems reasonable to me. It's just a heuristic and shouldn't affect correctness, right?

Yes.

The AMDGPU test change likely means nothing, but it'd be good if someone who maintains or work that target would ok it. I suggest giving it another week or so. Otherwise, it LGTM.

arsenm accepted this revision.Jul 21 2020, 3:50 PM

This revision is now accepted and ready to land.Jul 21 2020, 3:50 PM

I see three extra AMDGPU tests changed when rebase the patch. @evandro Would you please help me verify if there is no deg for those changes ? Thank you.

LLVM :: CodeGen/AMDGPU/call-argument-types.ll
LLVM :: CodeGen/AMDGPU/callee-special-input-vgprs.ll
LLVM :: CodeGen/AMDGPU/stack-realign.ll

Harbormaster completed remote builds in B66326: Diff 281808.Jul 30 2020, 1:43 AM

I'm not familiar with the AMDGPU target, but the changes in these new tests seem harmless, except in CodeGen/AMDGPU/stack-realign.ll, where an instruction disappeared. Again, I'd defer to someone who works in this backend, perhaps @arsenm, to chime in.

arsenm added inline comments.Jul 30 2020, 3:10 PM

llvm/test/CodeGen/AMDGPU/stack-realign.ll
164–171 ↗	(On Diff #281808)	Looks like an irrelevant change; no problem. I don't see anything missing?

evandro accepted this revision.Jul 30 2020, 4:39 PM

evandro added inline comments.

llvm/test/CodeGen/AMDGPU/stack-realign.ll
164–171 ↗	(On Diff #281808)	My bad.

Thank you for all the information!

steven.zhang updated this revision to Diff 282103.Jul 30 2020, 7:15 PM

Harbormaster completed remote builds in B66489: Diff 282103.Jul 30 2020, 7:51 PM

This revision was landed with ongoing or failed builds.Aug 6 2020, 9:58 PM

Closed by commit rG3359ea62edcc: [Scheduling] Create the missing dependency edges for store cluster (authored by steven.zhang). · Explain Why

This revision was automatically updated to reflect the committed changes.

steven.zhang added a commit: rG3359ea62edcc: [Scheduling] Create the missing dependency edges for store cluster.

Diff 246134

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,586 Lines • ▼ Show 20 Lines	for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {
if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,		if (TII->shouldClusterMemOps(MemOpRecords[Idx].BaseOps,
MemOpRecords[Idx + 1].BaseOps,		MemOpRecords[Idx + 1].BaseOps,
ClusterLength + 1)) {		ClusterLength + 1)) {
if (SUa->NodeNum > SUb->NodeNum)		if (SUa->NodeNum > SUb->NodeNum)
std::swap(SUa, SUb);		std::swap(SUa, SUb);
if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {		if (DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("		LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
<< SUb->NodeNum << ")\n");		<< SUb->NodeNum << ")\n");
		if (IsLoad) {
// Copy successor edges from SUa to SUb. Interleaving computation		// Copy successor edges from SUa to SUb. Interleaving computation
// dependent on SUa can prevent load combining due to register reuse.		// dependent on SUa can prevent load combining due to register reuse.
// Predecessor edges do not need to be copied from SUb to SUa since		// Predecessor edges do not need to be copied from SUb to SUa since
// nearby loads should have effectively the same inputs.		// nearby loads should have effectively the same inputs.
for (const SDep &Succ : SUa->Succs) {		for (const SDep &Succ : SUa->Succs) {
if (Succ.getSUnit() == SUb)		if (Succ.getSUnit() == SUb)
continue;		continue;
		fhahnUnsubmitted Not Done Reply Inline Actions I may be missing something, but IIRC stores could have other memory operations as successors, e.g. because to enforce an ordering between aliasing memory operations. fhahn: I may be missing something, but IIRC stores could have other memory operations as successors, e.
		steven.zhangAuthorUnsubmitted Done Reply Inline Actions It they have memory dependency, they won't be put into the same group. steven.zhang: It they have memory dependency, they won't be put into the same group.
		fhahnUnsubmitted Not Done Reply Inline Actions Ah right. The wording in the comment comment seems a bit general though, maybe it would be possible to mention that this is a constraint for the clustering. It might be good to add it as assert as well :) fhahn: Ah right. The wording in the comment comment seems a bit general though, maybe it would be…
		steven.zhangAuthorUnsubmitted Done Reply Inline Actions ok. I will update the comments and the assertion if it is easy ... Thank you for the comments! steven.zhang: ok. I will update the comments and the assertion if it is easy ... Thank you for the comments!
LLVM_DEBUG(dbgs()		LLVM_DEBUG(dbgs() << " Copy Succ SU(" << Succ.getSUnit()->NodeNum
<< " Copy Succ SU(" << Succ.getSUnit()->NodeNum << ")\n");		<< ")\n");
DAG->addEdge(Succ.getSUnit(), SDep(SUb, SDep::Artificial));		DAG->addEdge(Succ.getSUnit(), SDep(SUb, SDep::Artificial));
}		}
		} else {
		// Copy predecessor edges from SUb to SUa to avoid the SUnits that
		// SUb dependent on scheduled in-between SUb and SUa. Successor edges
		// do not need to be copied from SUa to SUb since no one will depend
		// on stores.
		// Notice that, we don't need to care about the memory dependency as
		// we won't try to cluster them if they have any memory dependency.
		for (const SDep &Pred : SUb->Preds) {
		if (Pred.getSUnit() == SUa)
		continue;
		LLVM_DEBUG(dbgs() << " Copy Pred SU(" << Pred.getSUnit()->NodeNum
		<< ")\n");
		DAG->addEdge(SUa, SDep(Pred.getSUnit(), SDep::Artificial));
		}
		}
++ClusterLength;		++ClusterLength;
} else		} else
ClusterLength = 1;		ClusterLength = 1;
} else		} else
ClusterLength = 1;		ClusterLength = 1;
}		}
}		}

▲ Show 20 Lines • Show All 2,160 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll

Show First 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	entry:
store volatile i64 %v, i64* %arrayidx1		store volatile i64 %v, i64* %arrayidx1
%arrayidx2 = getelementptr inbounds i64, i64* %P, i64 1		%arrayidx2 = getelementptr inbounds i64, i64* %P, i64 1
store volatile i64 %v, i64* %arrayidx2		store volatile i64 %v, i64* %arrayidx2
%arrayidx3 = getelementptr inbounds i64, i64* %P, i64 4		%arrayidx3 = getelementptr inbounds i64, i64* %P, i64 4
store volatile i64 %v, i64* %arrayidx3		store volatile i64 %v, i64* %arrayidx3
ret i64 %v		ret i64 %v
}		}

		; Verify that the SU(2) and SU(4) are the preds of SU(3)
		; CHECK: ******** MI Scheduling ********
		; CHECK-LABEL: stp_missing_preds_edges:%bb.0
		; CHECK:Cluster ld/st SU(3) - SU(5)
		; CHECK: Copy Pred SU(4)
		; CHECK: Copy Pred SU(2)
		; CHECK:SU(2): %0:gpr64common = COPY $x0
		; CHECK:SU(3): STRWui %1:gpr32, %0:gpr64common, 0
		; CHECK:SU(4): %3:gpr32common = nsw ADDWri %2:gpr32common, 5, 0
		; CHECK:SU(5): STRWui %3:gpr32common, %0:gpr64common, 1
		define void @stp_missing_preds_edges(i32* %p, i32 %m, i32 %n) {
		entry:
		store i32 %m, i32* %p, align 4
		%add = add nsw i32 %n, 5
		%arrayidx1 = getelementptr inbounds i32, i32* %p, i64 1
		store i32 %add, i32* %arrayidx1, align 4
		ret void
		}

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

	Show First 20 Lines • Show All 356 Lines • ▼ Show 20 Lines
	; SI-NEXT: s_mov_b32 s3, s7			; SI-NEXT: s_mov_b32 s3, s7
	; SI-NEXT: v_lshlrev_b32_e32 v0, 3, v0			; SI-NEXT: v_lshlrev_b32_e32 v0, 3, v0
	; SI-NEXT: v_mov_b32_e32 v1, 0			; SI-NEXT: v_mov_b32_e32 v1, 0
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: buffer_load_ubyte v2, v[0:1], s[0:3], 0 addr64			; SI-NEXT: buffer_load_ubyte v2, v[0:1], s[0:3], 0 addr64
	; SI-NEXT: buffer_load_ubyte v3, v[0:1], s[0:3], 0 addr64 offset:1			; SI-NEXT: buffer_load_ubyte v3, v[0:1], s[0:3], 0 addr64 offset:1
	; SI-NEXT: buffer_load_ubyte v6, v[0:1], s[0:3], 0 addr64 offset:2			; SI-NEXT: buffer_load_ubyte v6, v[0:1], s[0:3], 0 addr64 offset:2
	; SI-NEXT: buffer_load_ubyte v4, v[0:1], s[0:3], 0 addr64 offset:3			; SI-NEXT: buffer_load_ubyte v4, v[0:1], s[0:3], 0 addr64 offset:3
	; SI-NEXT: buffer_load_ubyte v7, v[0:1], s[0:3], 0 addr64 offset:4			; SI-NEXT: buffer_load_ubyte v5, v[0:1], s[0:3], 0 addr64 offset:4
	; SI-NEXT: buffer_load_ubyte v5, v[0:1], s[0:3], 0 addr64 offset:5			; SI-NEXT: buffer_load_ubyte v7, v[0:1], s[0:3], 0 addr64 offset:5
	; SI-NEXT: buffer_load_ubyte v0, v[0:1], s[0:3], 0 addr64 offset:6			; SI-NEXT: buffer_load_ubyte v0, v[0:1], s[0:3], 0 addr64 offset:6
	; SI-NEXT: s_mov_b32 s6, -1			; SI-NEXT: s_mov_b32 s6, -1
	; SI-NEXT: s_waitcnt vmcnt(5)			; SI-NEXT: s_waitcnt vmcnt(5)
	; SI-NEXT: v_cvt_f32_ubyte2_e32 v1, v3			; SI-NEXT: v_cvt_f32_ubyte2_e32 v1, v3
	; SI-NEXT: s_waitcnt vmcnt(3)			; SI-NEXT: s_waitcnt vmcnt(3)
	; SI-NEXT: v_lshlrev_b32_e32 v8, 8, v4			; SI-NEXT: v_lshlrev_b32_e32 v3, 8, v4
	; SI-NEXT: s_waitcnt vmcnt(2)			; SI-NEXT: s_waitcnt vmcnt(2)
	; SI-NEXT: v_cvt_f32_ubyte0_e32 v4, v7			; SI-NEXT: v_cvt_f32_ubyte0_e32 v4, v5
	; SI-NEXT: s_waitcnt vmcnt(1)			; SI-NEXT: s_waitcnt vmcnt(1)
	; SI-NEXT: v_cvt_f32_ubyte2_e32 v5, v5			; SI-NEXT: v_cvt_f32_ubyte2_e32 v5, v7
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: v_cvt_f32_ubyte0_e32 v7, v0			; SI-NEXT: v_cvt_f32_ubyte0_e32 v7, v0
	; SI-NEXT: v_or_b32_e32 v0, v8, v6			; SI-NEXT: v_or_b32_e32 v0, v3, v6
	; SI-NEXT: v_lshlrev_b32_e32 v0, 16, v0			; SI-NEXT: v_lshlrev_b32_e32 v0, 16, v0
	; SI-NEXT: v_or_b32_e32 v6, v0, v2			; SI-NEXT: v_or_b32_e32 v6, v0, v2
	; SI-NEXT: v_cvt_f32_ubyte3_e32 v3, v0			; SI-NEXT: v_cvt_f32_ubyte3_e32 v3, v0
	; SI-NEXT: v_cvt_f32_ubyte2_e32 v2, v0			; SI-NEXT: v_cvt_f32_ubyte2_e32 v2, v0
	; SI-NEXT: v_cvt_f32_ubyte0_e32 v0, v6			; SI-NEXT: v_cvt_f32_ubyte0_e32 v0, v6
	; SI-NEXT: buffer_store_dword v7, off, s[4:7], 0 offset:24			; SI-NEXT: buffer_store_dword v7, off, s[4:7], 0 offset:24
	; SI-NEXT: buffer_store_dwordx2 v[4:5], off, s[4:7], 0 offset:16			; SI-NEXT: buffer_store_dwordx2 v[4:5], off, s[4:7], 0 offset:16
	; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[4:7], 0			; SI-NEXT: buffer_store_dwordx4 v[0:3], off, s[4:7], 0
	Show All 9 Lines
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
	; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v0			; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v0
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: v_add_u32_e32 v2, vcc, 3, v0			; VI-NEXT: v_add_u32_e32 v2, vcc, 3, v0
	; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc
	; VI-NEXT: v_add_u32_e32 v4, vcc, 2, v0			; VI-NEXT: v_add_u32_e32 v4, vcc, 2, v0
	; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_ubyte v10, v[2:3]			; VI-NEXT: v_add_u32_e32 v6, vcc, 6, v0
	; VI-NEXT: flat_load_ubyte v11, v[4:5]
	; VI-NEXT: v_add_u32_e32 v2, vcc, 6, v0
	; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc
	; VI-NEXT: v_add_u32_e32 v4, vcc, 4, v0
	; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v1, vcc
	; VI-NEXT: v_add_u32_e32 v6, vcc, 5, v0
	; VI-NEXT: v_addc_u32_e32 v7, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v7, vcc, 0, v1, vcc
	; VI-NEXT: v_add_u32_e32 v8, vcc, 1, v0			; VI-NEXT: v_add_u32_e32 v8, vcc, 4, v0
	; VI-NEXT: v_addc_u32_e32 v9, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v9, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_ubyte v2, v[2:3]			; VI-NEXT: v_add_u32_e32 v10, vcc, 5, v0
	; VI-NEXT: flat_load_ubyte v3, v[4:5]			; VI-NEXT: v_addc_u32_e32 v11, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_ubyte v4, v[6:7]			; VI-NEXT: v_add_u32_e32 v12, vcc, 1, v0
	; VI-NEXT: flat_load_ubyte v5, v[8:9]			; VI-NEXT: v_addc_u32_e32 v13, vcc, 0, v1, vcc
				; VI-NEXT: flat_load_ubyte v6, v[6:7]
				; VI-NEXT: flat_load_ubyte v7, v[8:9]
				; VI-NEXT: flat_load_ubyte v8, v[10:11]
				; VI-NEXT: flat_load_ubyte v9, v[12:13]
	; VI-NEXT: flat_load_ubyte v0, v[0:1]			; VI-NEXT: flat_load_ubyte v0, v[0:1]
				; VI-NEXT: flat_load_ubyte v1, v[2:3]
				; VI-NEXT: flat_load_ubyte v2, v[4:5]
	; VI-NEXT: s_waitcnt vmcnt(6) lgkmcnt(6)			; VI-NEXT: s_waitcnt vmcnt(6) lgkmcnt(6)
	; VI-NEXT: v_lshlrev_b32_e32 v7, 8, v10			; VI-NEXT: v_cvt_f32_ubyte0_e32 v6, v6
				; VI-NEXT: s_waitcnt vmcnt(5) lgkmcnt(5)
				; VI-NEXT: v_cvt_f32_ubyte0_e32 v4, v7
	; VI-NEXT: s_waitcnt vmcnt(4) lgkmcnt(4)			; VI-NEXT: s_waitcnt vmcnt(4) lgkmcnt(4)
	; VI-NEXT: v_cvt_f32_ubyte0_e32 v6, v2			; VI-NEXT: v_cvt_f32_ubyte2_e32 v5, v8
	; VI-NEXT: v_or_b32_sdwa v2, v7, v11 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD			; VI-NEXT: s_waitcnt vmcnt(2) lgkmcnt(2)
				; VI-NEXT: v_cvt_f32_ubyte0_e32 v0, v0
	; VI-NEXT: s_waitcnt vmcnt(1) lgkmcnt(1)			; VI-NEXT: s_waitcnt vmcnt(1) lgkmcnt(1)
	; VI-NEXT: v_cvt_f32_ubyte2_e32 v1, v5			; VI-NEXT: v_lshlrev_b32_e32 v3, 8, v1
	; VI-NEXT: v_cvt_f32_ubyte2_e32 v5, v4
	; VI-NEXT: v_cvt_f32_ubyte0_e32 v4, v3
	; VI-NEXT: v_cvt_f32_ubyte3_e32 v3, v2
	; VI-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; VI-NEXT: v_cvt_f32_ubyte0_e32 v0, v0			; VI-NEXT: v_or_b32_sdwa v2, v3, v2 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
				; VI-NEXT: v_cvt_f32_ubyte3_e32 v3, v2
				; VI-NEXT: v_cvt_f32_ubyte2_e32 v1, v9
	; VI-NEXT: v_cvt_f32_ubyte2_e32 v2, v2			; VI-NEXT: v_cvt_f32_ubyte2_e32 v2, v2
	; VI-NEXT: buffer_store_dwordx3 v[4:6], off, s[4:7], 0 offset:16			; VI-NEXT: buffer_store_dwordx3 v[4:6], off, s[4:7], 0 offset:16
	; VI-NEXT: buffer_store_dwordx4 v[0:3], off, s[4:7], 0			; VI-NEXT: buffer_store_dwordx4 v[0:3], off, s[4:7], 0
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr <7 x i8>, <7 x i8> addrspace(1)* %in, i32 %tid			%gep = getelementptr <7 x i8>, <7 x i8> addrspace(1)* %in, i32 %tid
	%load = load <7 x i8>, <7 x i8> addrspace(1)* %gep, align 1			%load = load <7 x i8>, <7 x i8> addrspace(1)* %gep, align 1
	%cvt = uitofp <7 x i8> %load to <7 x float>			%cvt = uitofp <7 x i8> %load to <7 x float>
	▲ Show 20 Lines • Show All 492 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/half.ll

	Show First 20 Lines • Show All 306 Lines • ▼ Show 20 Lines
	; GCN: flat_load_dwordx4			; GCN: flat_load_dwordx4
	; GCN: flat_load_dwordx4			; GCN: flat_load_dwordx4

	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
				; SI: v_cvt_f32_f16_e32

	; GCN: flat_store_dwordx4			; GCN: flat_store_dwordx4

	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32

	; VI: v_cvt_f32_f16_e32			; VI: v_cvt_f32_f16_e32
	; VI: v_cvt_f32_f16_sdwa			; VI: v_cvt_f32_f16_sdwa


	; GCN: flat_store_dwordx4			; GCN: flat_store_dwordx4
	; GCN: flat_store_dwordx4			; GCN: flat_store_dwordx4
	; GCN: flat_store_dwordx4			; GCN: flat_store_dwordx4
	▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/salu-to-valu.ll

	Show First 20 Lines • Show All 167 Lines • ▼ Show 20 Lines

	; Original scalar load uses SGPR offset on SI and 32-bit literal on			; Original scalar load uses SGPR offset on SI and 32-bit literal on
	; CI.			; CI.

	; GCN-LABEL: {{^}}smrd_valu_ci_offset_x8:			; GCN-LABEL: {{^}}smrd_valu_ci_offset_x8:
	; GCN-NOHSA-DAG: s_mov_b32 [[OFFSET0:s[0-9]+]], 0x9a40{{$}}			; GCN-NOHSA-DAG: s_mov_b32 [[OFFSET0:s[0-9]+]], 0x9a40{{$}}
	; CI-NOHSA-DAG: s_mov_b32 [[OFFSET1:s[0-9]+]], 0x9a50{{$}}			; CI-NOHSA-DAG: s_mov_b32 [[OFFSET1:s[0-9]+]], 0x9a50{{$}}
	; CI-NOHSA-NOT: v_add			; CI-NOHSA-NOT: v_add
	; SI: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+:[0-9]+}}], 0 addr64 offset:16
	; CI-NOHSA: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+:[0-9]+}}], [[OFFSET1]] addr64{{$}}			; CI-NOHSA: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+:[0-9]+}}], [[OFFSET1]] addr64{{$}}
	; GCN-NOHSA: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+:[0-9]+}}], [[OFFSET0]] addr64{{$}}			; GCN-NOHSA: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+:[0-9]+}}], [[OFFSET0]] addr64{{$}}
				; SI: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+:[0-9]+}}], 0 addr64 offset:16

	; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}			; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}
	; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}			; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}
	; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}			; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}
	; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}			; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}
	; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}			; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}
	; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}			; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}
	; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}			; GCN-NOHSA: v_or_b32_e32 {{v[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}
	▲ Show 20 Lines • Show All 327 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll

	Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: s_load_dword s0, s[0:1], 0x0			; SI-NEXT: s_load_dword s0, s[0:1], 0x0
	; SI-NEXT: s_waitcnt lgkmcnt(0)			; SI-NEXT: s_waitcnt lgkmcnt(0)
	; SI-NEXT: s_add_i32 s0, s0, 34			; SI-NEXT: s_add_i32 s0, s0, 34
	; SI-NEXT: s_or_b32 s0, s0, 4			; SI-NEXT: s_or_b32 s0, s0, 4
	; SI-NEXT: v_mov_b32_e32 v0, s0			; SI-NEXT: v_mov_b32_e32 v0, s0
	; SI-NEXT: s_bfe_u32 s0, s0, 0x10010			; SI-NEXT: s_bfe_u32 s0, s0, 0x10010
	; SI-NEXT: buffer_store_short v0, off, s[4:7], 0			; SI-NEXT: buffer_store_short v0, off, s[4:7], 0
				; SI-NEXT: v_mov_b32_e32 v1, s0
	; SI-NEXT: s_mov_b32 s4, 2			; SI-NEXT: s_mov_b32 s4, 2
	; SI-NEXT: s_waitcnt expcnt(0)			; SI-NEXT: buffer_store_byte v1, off, s[4:7], 0
	; SI-NEXT: v_mov_b32_e32 v0, s0
	; SI-NEXT: buffer_store_byte v0, off, s[4:7], 0
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: widen_i17_constant_load:			; VI-LABEL: widen_i17_constant_load:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24			; VI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
	; VI-NEXT: v_mov_b32_e32 v0, 0			; VI-NEXT: v_mov_b32_e32 v0, 0
	; VI-NEXT: v_mov_b32_e32 v2, 2			; VI-NEXT: v_mov_b32_e32 v2, 2
	; VI-NEXT: v_mov_b32_e32 v1, 0			; VI-NEXT: v_mov_b32_e32 v1, 0
	▲ Show 20 Lines • Show All 342 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Scheduling] Create the missing dependency edges for store cluster
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 246134

llvm/lib/CodeGen/MachineScheduler.cpp

llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

llvm/test/CodeGen/AMDGPU/half.ll

llvm/test/CodeGen/AMDGPU/salu-to-valu.ll

llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll

This is an archive of the discontinued LLVM Phabricator instance.

[Scheduling] Create the missing dependency edges for store clusterClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 246134

llvm/lib/CodeGen/MachineScheduler.cpp

llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

llvm/test/CodeGen/AMDGPU/half.ll

llvm/test/CodeGen/AMDGPU/salu-to-valu.ll

llvm/test/CodeGen/AMDGPU/widen-smrd-loads.ll

[Scheduling] Create the missing dependency edges for store cluster
ClosedPublic