This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
5/5
AMDGPULowerModuleLDSPass.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
lds-global-value.ll
-
ds_read2.ll
-
ds_read2_offset_order.ll
-
ds_write2.ll
1/1
lds-alignment.ll
-
promote-alloca-globals.ll
-
update-lds-alignment.ll

Differential D103261

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
ClosedPublic

Authored by hsmhsm on May 27 2021, 9:02 AM.

Download Raw Diff

Details

Reviewers

JonChesterfield
arsenm
b-sumner
t-tye
rampitec
ronlieb
foad

Commits

rG52ffbfdffc24: [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
rGd71ff907ef23: [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.

Summary

Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hsmhsm created this revision.May 27 2021, 9:02 AM

Herald added subscribers: kerbowa, hiraditya, tpr and 5 others. · View Herald TranscriptMay 27 2021, 9:02 AM

hsmhsm requested review of this revision.May 27 2021, 9:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2021, 9:02 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B106531: Diff 348292.May 27 2021, 9:38 AM

Thanks! I'd suggest to combine all fix-lds-alignmen*.ll tests into one. Just use different names for different blocks of variables so they are used only in one kernel.

llvm/test/CodeGen/AMDGPU/lds-alignment.ll
129–130	Calculation comments need to be updated in the test.

Fixed review comments by Stas.

hsmhsm marked an inline comment as done.May 27 2021, 11:17 AM

Harbormaster completed remote builds in B106561: Diff 348336.May 27 2021, 11:35 AM

What's the use case for this? It will increase memory use if it increases the alignment of any variables

In D103261#2785495, @JonChesterfield wrote:

What's the use case for this? It will increase memory use if it increases the alignment of any variables

It may increase fragmentation if we have a lot of small underaligned arrays, but in general superalignment should give a better performance. We are trading memory for performance. Changes exposed by the ISA tests are mostly positive. We may want to add a threshold for the FoundLocalVars.size() to inhibit the superalignment as the fragmentation is a function of the number of variables. Let's say in a worst case we will waste 15 bytes. To stay below 1Kb the threshold would be 68 which is plenty of variables. In reality fragmentation will be even less as we are not going to waste maximum. The other option is to compute total allocation and only superalign if ST.getOccupancyWithLocalMemSize() does not drop (and we do not exceed getLocalMemorySize() of course). The latter is more expensive but shall work better than a simple threshold. AMDGPUSubtarget::getMaxLocalMemSizeWithWaveCount() can be used to simplify the logic (one can check AMDGPUPromoteAlloca.cpp for the usage).

I'd say I am in favor of this change.

Do we actually see underaligned LDS variables in practice? What if the user is forcing a lower alignment to prioritize space utilization?

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
180–181	Separating these functions is just making this harder to follow. How about just one function to fully inspect the GV and increase alignment
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
153 ↗	(On Diff #348336)	"Fix" isn't very descriptive. How about increaseAlignment?
155 ↗	(On Diff #348336)	This is ignoring the explicit alignment

In D103261#2785541, @arsenm wrote:

Do we actually see underaligned LDS variables in practice? What if the user is forcing a lower alignment to prioritize space utilization?

It shall be mostly cases like "local char a[2]; local short b[3];" etc.

Fixed review comments by Matt.

hsmhsm retitled this revision from [AMDGPU] Fix natural alignment of LDS globals during LDS lowering. to [AMDGPU] Increase natural alignment of LDS globals, if required, during LDS lowering..May 27 2021, 9:26 PM

hsmhsm edited the summary of this revision. (Show Details)

hsmhsm marked 3 inline comments as done.

hsmhsm retitled this revision from [AMDGPU] Increase natural alignment of LDS globals, if required, during LDS lowering. to [AMDGPU] If required, increase natural alignment of LDS globals before LDS lowering..May 27 2021, 9:30 PM

Harbormaster completed remote builds in B106646: Diff 348442.May 27 2021, 10:13 PM

Chaged lit test file name from fix-lds-alignment.ll to update-lds-alignment.ll.

Remove fixme comment "; FIXME: Improve alignment" within lit test file lds-alignment.ll since it is fixed now.

Harbormaster completed remote builds in B106651: Diff 348446.May 27 2021, 11:12 PM

foad added inline comments.May 28 2021, 1:38 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
159–161 ↗	(On Diff #348446)	I think this needs a better explanation. Normally increasing the alignment would still be considered to "honor" the original alignment.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
45–48 ↗	(On Diff #348446)	I don't think it's a good idea to have three copies of this block comment (maybe just keep this one?) and I don't particularly like the use of "under-aligned" and "natural alignment" here. I would say something like: "Increase the alignment of LDS globals if necessary to maximise the chance that we can use aligned LDS instructions to access them"

Fix review comments by Jay.

hsmhsm marked 2 inline comments as done.May 28 2021, 2:05 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
159–161 ↗	(On Diff #348446)	I was thinking about "#pragma clang section" defined for globals. But, now I think, it is not of a much meaning to LDS. And, I guess, it is programmatically and semantically fine to increase the alignment of LDS if it is necessary. Hence, remove the check itself.

hsmhsm retitled this revision from [AMDGPU] If required, increase natural alignment of LDS globals before LDS lowering. to [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering..May 28 2021, 2:12 AM

hsmhsm edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B106664: Diff 348466.May 28 2021, 2:36 AM

foad added inline comments.May 28 2021, 2:45 AM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
47 ↗	(On Diff #348466)	I think the first argument can be simpler: `ArrayRef<GlobalVariable *> LDSGlobals`.

Fix review comment by Jay - use ArrayRef.

hsmhsm marked an inline comment as done.May 28 2021, 3:08 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
47 ↗	(On Diff #348466)	Oh! yes, it much nicer.

hsmhsm marked an inline comment as done.May 28 2021, 3:09 AM

Harbormaster completed remote builds in B106677: Diff 348482.May 28 2021, 3:55 AM

Not really the right place for this comment, but better odds of it being seen than in other bug trackers. Mahesha reports:

@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8
@llvm.used = appending global [2 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @lds to i8 addrspace(3)*) to i8*), i8* addrspacecast (i8 addrspace(1)* bitcast (i64* addrspace(1)* @gptr to i8 addrspace(1)*) to i8*)], section "llvm.metadata"
define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}

^ that doesn't get transformed, suggesting a bug in shouldLowerLDSToStruct. The pattern of return !F for conservative should probably be repeated where it currently has return false. I thought we had a test case for storing address of LDS in a global, but presumably not.

In D103261#2787572, @JonChesterfield wrote:
Not really the right place for this comment, but better odds of it being seen than in other bug trackers. Mahesha reports:
@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8
@llvm.used = appending global [2 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @lds to i8 addrspace(3)*) to i8*), i8* addrspacecast (i8 addrspace(1)* bitcast (i64* addrspace(1)* @gptr to i8 addrspace(1)*) to i8*)], section "llvm.metadata"
define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}
^ that doesn't get transformed, suggesting a bug in shouldLowerLDSToStruct. The pattern of return !F for conservative should probably be repeated where it currently has return false. I thought we had a test case for storing address of LDS in a global, but presumably not.

In general, shouldLowerLDSToStruct() requires some clean-up and improvement in order to handle all scenarios including handling of constants for kernel LDS lowering, which we have not yet handled. Let me take all these missing things in a separate patch.

Rebased.

Harbormaster completed remote builds in B106917: Diff 348803.May 31 2021, 8:39 AM

I'm not very convinced by this given the recent enthusiasm for decreasing LDS usage. Types are naturally aligned by default, so I think the only time they are aligned by less than that is when the programmer asked for it (or when the vectorizer is involved, but that's disabled on amdgpu afaik).

If someone has a char data[16] align(4) in LDS, i'm not at all sure it's obvious that they want the alignment increased to 16, despite that using more LDS than they asked for.

We could do something more conservative, where we put variables in order based on their sizes and alignments, and then go through the resulting struct and tag variables with the additional alignment they happen to have as a result of position in the struct. That would be pure performance win, zero storage overhead cost. It leaves the choice to burn memory in favour of faster instructions in the hands of the developer writing the code.

In D103261#2789956, @JonChesterfield wrote:

I'm not very convinced by this given the recent enthusiasm for decreasing LDS usage. Types are naturally aligned by default, so I think the only time they are aligned by less than that is when the programmer asked for it (or when the vectorizer is involved, but that's disabled on amdgpu afaik).

If someone has a char data[16] align(4) in LDS, i'm not at all sure it's obvious that they want the alignment increased to 16, despite that using more LDS than they asked for.

We could do something more conservative, where we put variables in order based on their sizes and alignments, and then go through the resulting struct and tag variables with the additional alignment they happen to have as a result of position in the struct. That would be pure performance win, zero storage overhead cost. It leaves the choice to burn memory in favour of faster instructions in the hands of the developer writing the code.

The main intention behind this patch is to change

char data[16] __align__(4)

char data[16] __align__(16)

since data should be aligned at 16 bytes boundary. Theoritically speaking, it may not correct to change the value mentined by porgrammer. But, from the practical point of view, since increasing the aligment value is programatically safe and since it fixes the performance issues due to unaligned access, I guess, it is fine to change it.

The usual approach to reduce the memory overhead due to padding within struct type is - to first sort the members based on their size, before padding [ Data structure alignment ] . But, I guess, we are sorting here based on alignment first, and then based on size. May be we need to revisit it.

I do not understand what you mean by - "and then go through the resulting struct and tag variables with the additional alignment they happen to have as a result of position in the struct"

I like the ISA changes, I'd say let's try it. If needed we could restrict it later with total LDS consumption calculation.
@hsmhsm please let a day before the submit in case if people have strong objections.

This revision is now accepted and ready to land.Jun 3 2021, 1:29 PM

arsenm added inline comments.Jun 3 2021, 2:43 PM

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
156 ↗	(On Diff #348803)	I think this is a bad helper and it would be clearer to explicitly query the GV alignment
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
47 ↗	(On Diff #348803)	This should handle one GV at a time. I don't see a benefit to pushing the loop into the function

Fixed review comments by Matt.

This revision was landed with ongoing or failed builds.Jun 3 2021, 9:07 PM

Closed by commit rGd71ff907ef23: [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. (authored by hsmhsm). · Explain Why

This revision was automatically updated to reflect the committed changes.

hsmhsm added a commit: rGd71ff907ef23: [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering..

Harbormaster completed remote builds in B107603: Diff 349760.Jun 3 2021, 9:43 PM

hsmhsm added a reverting change: rG753437fc1db3: Revert "[AMDGPU] Increase alignment of LDS globals if necessary before LDS….Jun 3 2021, 10:48 PM

hsmhsm mentioned this in D103671: [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering..Jun 4 2021, 12:31 AM

If we're going with this anyway, let's be explicit in the title and commit message that we are over-aligning data. We aren't 'fixing' things, or doing them 'properly', we are intentionally burning memory in the hope of improving runtime.

I expect overrule-the-developer-intent transforms to generate bug reports. We could head those off by including a command line option to disable the overalignment.

hsmhsm reopened this revision.Jun 4 2021, 12:52 AM

This revision is now accepted and ready to land.Jun 4 2021, 12:52 AM

The earlier commit had to be reverted since Align Alignment(GV->getAlignment()); was causing assert when GV->getAlignment() returns 0.

In D103261#2798291, @JonChesterfield wrote:

If we're going with this anyway, let's be explicit in the title and commit message that we are over-aligning data. We aren't 'fixing' things, or doing them 'properly', we are intentionally burning memory in the hope of improving runtime.

I expect overrule-the-developer-intent transforms to generate bug reports. We could head those off by including a command line option to disable the overalignment.

I agree to the fact that we might in some cases going against the intent of the programmer (but it is safe), but do not agree to the fact that we are buring memory here. W.r.t un-optimal usage of padding memory, it is not because of this patch, it is actually because of sorting logic. We are sorting using alignment as a primary key, but, we should actually sort using size as primary key.

If you think that this patch is burning memory, then just think about how much memory, this original ModuleLDSLowering pass itself is burning.

Harbormaster completed remote builds in B107619: Diff 349782.Jun 4 2021, 1:31 AM

Rebased.

Harbormaster completed remote builds in B107854: Diff 350108.Jun 6 2021, 9:51 AM

Hi @JonChesterfield

We believe, this patch is harmless and does not cause any potential problems. As I already mentioned in my previous comment, the un-optimal padding memory is largely due to sorting logic (we should use size as primary key instead of alignment). Morever, we think, this patch is very important in our attempt to solve perf issues due to unliagned access of LDS. So, I am going to submit this patch since it is accepted.

Closed by commit rG52ffbfdffc24: [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. (authored by hsmhsm). · Explain WhyJun 7 2021, 5:32 AM

This revision was automatically updated to reflect the committed changes.

hsmhsm added a commit: rG52ffbfdffc24: [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering..

I note that you did not change the commit message to reflect the contents of the patch as requested.

In D103261#2802464, @JonChesterfield wrote:

I note that you did not change the commit message to reflect the contents of the patch as requested.

The commit message says, what the the patch is doing - it increases the alignment of LDS if it is under-aligned in order to reduce the probability of unlinged access. So, I do not think, any change to commit message is required.

In D103261#2802471, @hsmhsm wrote:

In D103261#2802464, @JonChesterfield wrote:

I note that you did not change the commit message to reflect the contents of the patch as requested.

The commit message says, what the the patch is doing - it increases the alignment of LDS if it is under-aligned in order to reduce the probability of unlinged access. So, I do not think, any change to commit message is required.

The commit message, as landed, is:

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.

Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.

Increasing the alignment is not necessary. Variables with less than natural alignment are not 'improper'. The commit message is inaccurate, as identified above.

This is as best an optimisation. It's also inconsistent with the other work on LDS that tries to minimise usage, in that this patch deliberately wastes some to aid instruction selection. As it is one that explicitly ignores programmer annotations we should not be surprised to see it return to us in a bug report.

I hoped to avoid further antagonising the (hopefully hypothetical) developer who finds this at bottom of a git bisect by not claiming it is necessary, and preferably by providing a command line hook to disable it.

In D103261#2802513, @JonChesterfield wrote:

In D103261#2802471, @hsmhsm wrote:

In D103261#2802464, @JonChesterfield wrote:

I note that you did not change the commit message to reflect the contents of the patch as requested.

The commit message says, what the the patch is doing - it increases the alignment of LDS if it is under-aligned in order to reduce the probability of unlinged access. So, I do not think, any change to commit message is required.

The commit message, as landed, is:

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.

Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.

Increasing the alignment is not necessary. Variables with less than natural alignment are not 'improper'. The commit message is inaccurate, as identified above.

This is as best an optimisation. It's also inconsistent with the other work on LDS that tries to minimise usage, in that this patch deliberately wastes some to aid instruction selection. As it is one that explicitly ignores programmer annotations we should not be surprised to see it return to us in a bug report.

I hoped to avoid further antagonising the (hopefully hypothetical) developer who finds this at bottom of a git bisect by not claiming it is necessary, and preferably by providing a command line hook to disable it.

I probably would think that the only phrase which is bit mis-leading is "if necessary" part in the message. Instead, it would have been like this - "Increase alignment of LDS globals if it is not on natural align boundary, before LDS lowering". But, I pesonally, it is okay to live with it.
W.r.t command line switch, if you expect one, we can still add it in a new patch.

Yes, I see that you agree with the wording you committed. We can't fix it now.

this patch is harmless and does not cause any potential problems

This patch increases LDS usage. Increasing LDS usage can take us over an occupancy threshold, where the last threshold converts a program that runs to one that doesn't. Using more LDS here also prevents it from being used by promote alloca, where it may have made a bigger performance improvement.

The code change itself is probably OK, my objection is to characterising a transform that ignores user annotation and can improve, degrade, or break programs as 'necessary' or 'proper', and that you stuck with that description even after attention was drawn to it.

In D103261#2802563, @JonChesterfield wrote:

Yes, I see that you agree with the wording you committed. We can't fix it now.

this patch is harmless and does not cause any potential problems

This patch increases LDS usage. Increasing LDS usage can take us over an occupancy threshold, where the last threshold converts a program that runs to one that doesn't. Using more LDS here also prevents it from being used by promote alloca, where it may have made a bigger performance improvement.

The code change itself is probably OK, my objection is to characterising a transform that ignores user annotation and can improve, degrade, or break programs as 'necessary' or 'proper', and that you stuck with that description even after attention was drawn to it.

As you suggested, and as I agreed to it, let's add a command line switch, make it "on" by default, and let user turn it off, if he/she does not want this transformation.

foad added inline comments.Jun 7 2021, 6:58 AM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
178	This should really be using getAlign. getAlignment has a FIXME comment saying it will be removed. You should use the Align or MaybeAlign types throughout instead of unsigned.
181	Why continue here? Surely it's better to fall through into the code that increases alignment?

hsmhsm marked 2 inline comments as done.Jun 7 2021, 7:19 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
178	I was using getAlign only, but l had to change it based on one of review comments, I think. Anyway, I will fix it in a new patch that I am planning to submit soon, to introduce a command line flag for this transformation.
181	Will fix it in new patch.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPULowerModuleLDSPass.cpp

23 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

lds-global-value.ll

4 lines

ds_read2.ll

25 lines

ds_read2_offset_order.ll

6 lines

ds_write2.ll

57 lines

lds-alignment.ll

42 lines

promote-alloca-globals.ll

2 lines

update-lds-alignment.ll

193 lines

Diff 349761

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	bool processUsedLDS(Module &M, Function *F = nullptr) {
std::vector<GlobalVariable *> FoundLocalVars =		std::vector<GlobalVariable *> FoundLocalVars =
AMDGPU::findVariablesToLower(M, UsedList, F);		AMDGPU::findVariablesToLower(M, UsedList, F);

if (FoundLocalVars.empty()) {		if (FoundLocalVars.empty()) {
// No variables to rewrite, no changes made.		// No variables to rewrite, no changes made.
return false;		return false;
}		}

		// Increase the alignment of LDS globals if necessary to maximise the chance
		// that we can use aligned LDS instructions to access them.
		for (auto *GV : FoundLocalVars) {
		Align Alignment(GV->getAlignment());
		foadUnsubmitted Not Done Reply Inline Actions This should really be using getAlign. getAlignment has a FIXME comment saying it will be removed. You should use the Align or MaybeAlign types throughout instead of unsigned. foad: This should really be using getAlign. getAlignment has a FIXME comment saying it will be…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions I was using getAlign only, but l had to change it based on one of review comments, I think. Anyway, I will fix it in a new patch that I am planning to submit soon, to introduce a command line flag for this transformation. hsmhsm: I was using getAlign only, but l had to change it based on one of review comments, I think.
		TypeSize GVSize = DL.getTypeAllocSize(GV->getValueType());

		if (GVSize > 8) {
		arsenmUnsubmitted Done Reply Inline Actions Separating these functions is just making this harder to follow. How about just one function to fully inspect the GV and increase alignment arsenm: Separating these functions is just making this harder to follow. How about just one function to…
		foadUnsubmitted Not Done Reply Inline Actions Why continue here? Surely it's better to fall through into the code that increases alignment? foad: Why continue here? Surely it's better to fall through into the code that increases alignment?
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Will fix it in new patch. hsmhsm: Will fix it in new patch.
		// We might want to use a b96 or b128 load/store
		Alignment = std::max(Alignment, Align(16));
		} else if (GVSize > 4) {
		// We might want to use a b64 load/store
		Alignment = std::max(Alignment, Align(8));
		} else if (GVSize > 2) {
		// We might want to use a b32 load/store
		Alignment = std::max(Alignment, Align(4));
		} else if (GVSize > 1) {
		// We might want to use a b16 load/store
		Alignment = std::max(Alignment, Align(2));
		}

		GV->setAlignment(Alignment);
		}

// Sort by alignment, descending, to minimise padding.		// Sort by alignment, descending, to minimise padding.
// On ties, sort by size, descending, then by name, lexicographical.		// On ties, sort by size, descending, then by name, lexicographical.
llvm::stable_sort(		llvm::stable_sort(
FoundLocalVars,		FoundLocalVars,
[&](const GlobalVariable LHS, const GlobalVariable RHS) -> bool {		[&](const GlobalVariable LHS, const GlobalVariable RHS) -> bool {
Align ALHS = AMDGPU::getAlign(DL, LHS);		Align ALHS = AMDGPU::getAlign(DL, LHS);
Align ARHS = AMDGPU::getAlign(DL, RHS);		Align ARHS = AMDGPU::getAlign(DL, RHS);
if (ALHS != ARHS) {		if (ALHS != ARHS) {
▲ Show 20 Lines • Show All 130 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-value.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -global-isel -mtriple=amdgcn-amd-amdhsa -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck %s
	; TODO: Replace with existing DAG tests			; TODO: Replace with existing DAG tests

	@lds_512_4 = internal unnamed_addr addrspace(3) global [128 x i32] undef, align 4			@lds_512_4 = internal unnamed_addr addrspace(3) global [128 x i32] undef, align 4
	@lds_4_8 = addrspace(3) global i32 undef, align 8			@lds_4_8 = addrspace(3) global i32 undef, align 8

	define amdgpu_kernel void @use_lds_globals(i32 addrspace(1)* %out, i32 addrspace(3)* %in) #0 {			define amdgpu_kernel void @use_lds_globals(i32 addrspace(1)* %out, i32 addrspace(3)* %in) #0 {
	; CHECK-LABEL: use_lds_globals:			; CHECK-LABEL: use_lds_globals:
	; CHECK: ; %bb.0: ; %entry			; CHECK: ; %bb.0: ; %entry
	; CHECK-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0			; CHECK-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0
	; CHECK-NEXT: v_mov_b32_e32 v0, 8			; CHECK-NEXT: v_mov_b32_e32 v0, 4
	; CHECK-NEXT: s_mov_b32 m0, -1			; CHECK-NEXT: s_mov_b32 m0, -1
	; CHECK-NEXT: ds_read_b32 v3, v0			; CHECK-NEXT: ds_read_b32 v3, v0
	; CHECK-NEXT: v_mov_b32_e32 v2, 9			; CHECK-NEXT: v_mov_b32_e32 v2, 9
	; CHECK-NEXT: s_waitcnt lgkmcnt(0)			; CHECK-NEXT: s_waitcnt lgkmcnt(0)
	; CHECK-NEXT: s_add_u32 s0, s0, 4			; CHECK-NEXT: s_add_u32 s0, s0, 4
	; CHECK-NEXT: s_addc_u32 s1, s1, 0			; CHECK-NEXT: s_addc_u32 s1, s1, 0
	; CHECK-NEXT: v_mov_b32_e32 v0, s0			; CHECK-NEXT: v_mov_b32_e32 v0, s0
	; CHECK-NEXT: v_mov_b32_e32 v1, s1			; CHECK-NEXT: v_mov_b32_e32 v1, s1
	; CHECK-NEXT: flat_store_dword v[0:1], v3			; CHECK-NEXT: flat_store_dword v[0:1], v3
	; CHECK-NEXT: v_mov_b32_e32 v0, 0			; CHECK-NEXT: v_mov_b32_e32 v0, 0x200
	; CHECK-NEXT: ds_write_b32 v0, v2			; CHECK-NEXT: ds_write_b32 v0, v2
	; CHECK-NEXT: s_endpgm			; CHECK-NEXT: s_endpgm
	entry:			entry:
	%tmp0 = getelementptr [128 x i32], [128 x i32] addrspace(3)* @lds_512_4, i32 0, i32 1			%tmp0 = getelementptr [128 x i32], [128 x i32] addrspace(3)* @lds_512_4, i32 0, i32 1
	%tmp1 = load i32, i32 addrspace(3)* %tmp0			%tmp1 = load i32, i32 addrspace(3)* %tmp0
	%tmp2 = getelementptr i32, i32 addrspace(1)* %out, i32 1			%tmp2 = getelementptr i32, i32 addrspace(1)* %out, i32 1
	store i32 %tmp1, i32 addrspace(1)* %tmp2			store i32 %tmp1, i32 addrspace(1)* %tmp2
	store i32 9, i32 addrspace(3)* @lds_4_8			store i32 9, i32 addrspace(3)* @lds_4_8
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/test/CodeGen/AMDGPU/ds_read2.ll

Show First 20 Lines • Show All 1,001 Lines • ▼ Show 20 Lines	; GFX9-NEXT: s_endpgm
ret void		ret void
}		}

@bar = addrspace(3) global [4 x i64] undef, align 4		@bar = addrspace(3) global [4 x i64] undef, align 4

define amdgpu_kernel void @load_misaligned64_constant_offsets(i64 addrspace(1)* %out) {		define amdgpu_kernel void @load_misaligned64_constant_offsets(i64 addrspace(1)* %out) {
; CI-LABEL: load_misaligned64_constant_offsets:		; CI-LABEL: load_misaligned64_constant_offsets:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: v_mov_b32_e32 v2, 0		; CI-NEXT: v_mov_b32_e32 v0, 0
; CI-NEXT: s_mov_b32 m0, -1		; CI-NEXT: s_mov_b32 m0, -1
; CI-NEXT: ds_read2_b32 v[0:1], v2 offset1:1		; CI-NEXT: ds_read2_b64 v[0:3], v0 offset1:1
; CI-NEXT: ds_read2_b32 v[2:3], v2 offset0:2 offset1:3
; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9		; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9
; CI-NEXT: s_mov_b32 s3, 0xf000		; CI-NEXT: s_mov_b32 s3, 0xf000
; CI-NEXT: s_mov_b32 s2, -1		; CI-NEXT: s_mov_b32 s2, -1
; CI-NEXT: s_waitcnt lgkmcnt(0)		; CI-NEXT: s_waitcnt lgkmcnt(0)
; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v2		; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v2
; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc		; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
; CI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0		; CI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0
; CI-NEXT: s_endpgm		; CI-NEXT: s_endpgm
;		;
; GFX9-ALIGNED-LABEL: load_misaligned64_constant_offsets:		; GFX9-ALIGNED-LABEL: load_misaligned64_constant_offsets:
; GFX9-ALIGNED: ; %bb.0:		; GFX9-ALIGNED: ; %bb.0:
; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v4, 0		; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v4, 0
; GFX9-ALIGNED-NEXT: ds_read2_b32 v[0:1], v4 offset1:1		; GFX9-ALIGNED-NEXT: ds_read2_b64 v[0:3], v4 offset1:1
; GFX9-ALIGNED-NEXT: ds_read2_b32 v[2:3], v4 offset0:2 offset1:3
; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24		; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-ALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2		; GFX9-ALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
; GFX9-ALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc		; GFX9-ALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
; GFX9-ALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]		; GFX9-ALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]
; GFX9-ALIGNED-NEXT: s_endpgm		; GFX9-ALIGNED-NEXT: s_endpgm
;		;
; GFX9-UNALIGNED-LABEL: load_misaligned64_constant_offsets:		; GFX9-UNALIGNED-LABEL: load_misaligned64_constant_offsets:
; GFX9-UNALIGNED: ; %bb.0:		; GFX9-UNALIGNED: ; %bb.0:
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v4, 0		; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v4, 0
; GFX9-UNALIGNED-NEXT: ds_read2_b64 v[0:3], v4 offset1:1		; GFX9-UNALIGNED-NEXT: ds_read_b128 v[0:3], v4
; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24		; GFX9-UNALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-UNALIGNED-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-UNALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2		; GFX9-UNALIGNED-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
; GFX9-UNALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc		; GFX9-UNALIGNED-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]		; GFX9-UNALIGNED-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]
; GFX9-UNALIGNED-NEXT: s_endpgm		; GFX9-UNALIGNED-NEXT: s_endpgm
%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4		%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4
%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4		%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4
%sum = add i64 %val0, %val1		%sum = add i64 %val0, %val1
store i64 %sum, i64 addrspace(1)* %out, align 8		store i64 %sum, i64 addrspace(1)* %out, align 8
ret void		ret void
}		}

@bar.large = addrspace(3) global [4096 x i64] undef, align 4		@bar.large = addrspace(3) global [4096 x i64] undef, align 4

define amdgpu_kernel void @load_misaligned64_constant_large_offsets(i64 addrspace(1)* %out) {		define amdgpu_kernel void @load_misaligned64_constant_large_offsets(i64 addrspace(1)* %out) {
; CI-LABEL: load_misaligned64_constant_large_offsets:		; CI-LABEL: load_misaligned64_constant_large_offsets:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: v_mov_b32_e32 v0, 0x4000		; CI-NEXT: v_mov_b32_e32 v2, 0
; CI-NEXT: v_mov_b32_e32 v2, 0x7ff8
; CI-NEXT: s_mov_b32 m0, -1		; CI-NEXT: s_mov_b32 m0, -1
; CI-NEXT: ds_read2_b32 v[0:1], v0 offset1:1		; CI-NEXT: ds_read_b64 v[0:1], v2 offset:16384
; CI-NEXT: ds_read2_b32 v[2:3], v2 offset1:1		; CI-NEXT: ds_read_b64 v[2:3], v2 offset:32760
; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9		; CI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9
; CI-NEXT: s_mov_b32 s3, 0xf000		; CI-NEXT: s_mov_b32 s3, 0xf000
; CI-NEXT: s_mov_b32 s2, -1		; CI-NEXT: s_mov_b32 s2, -1
; CI-NEXT: s_waitcnt lgkmcnt(0)		; CI-NEXT: s_waitcnt lgkmcnt(0)
; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v2		; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v2
; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc		; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
; CI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0		; CI-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0
; CI-NEXT: s_endpgm		; CI-NEXT: s_endpgm
;		;
; GFX9-LABEL: load_misaligned64_constant_large_offsets:		; GFX9-LABEL: load_misaligned64_constant_large_offsets:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: v_mov_b32_e32 v0, 0x4000
; GFX9-NEXT: v_mov_b32_e32 v2, 0x7ff8
; GFX9-NEXT: ds_read2_b32 v[0:1], v0 offset1:1
; GFX9-NEXT: ds_read2_b32 v[2:3], v2 offset1:1
; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-NEXT: v_mov_b32_e32 v4, 0		; GFX9-NEXT: v_mov_b32_e32 v4, 0
		; GFX9-NEXT: ds_read_b64 v[0:1], v4 offset:16384
		; GFX9-NEXT: ds_read_b64 v[2:3], v4 offset:32760
		; GFX9-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x24
; GFX9-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2		; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc		; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]		; GFX9-NEXT: global_store_dwordx2 v4, v[0:1], s[0:1]
; GFX9-NEXT: s_endpgm		; GFX9-NEXT: s_endpgm
%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4		%val0 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4
%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4		%val1 = load i64, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4
%sum = add i64 %val0, %val1		%sum = add i64 %val0, %val1
▲ Show 20 Lines • Show All 468 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds_read2_offset_order.ll

	; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s			; RUN: llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -strict-whitespace -check-prefix=SI %s

	@lds = addrspace(3) global [512 x float] undef, align 4			@lds = addrspace(3) global [512 x float] undef, align 4

	; offset0 is larger than offset1			; offset0 is larger than offset1

	; SI-LABEL: {{^}}offset_order:			; SI-LABEL: {{^}}offset_order:
	; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset1:14{{$}}			; SI-DAG: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}}
	; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset0:2 offset1:3			; SI-DAG: ds_read_b64 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset:8
	; SI-DAG: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:1024
	; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset0:11 offset1:12			; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset0:11 offset1:12
				; SI-DAG: ds_read2_b32 v[{{[0-9]+}}:{{[0-9]+}}], v{{[0-9]+}} offset0:6 offset1:248
	define amdgpu_kernel void @offset_order(float addrspace(1)* %out) {			define amdgpu_kernel void @offset_order(float addrspace(1)* %out) {
	entry:			entry:
	%ptr0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 0			%ptr0 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 0
	%val0 = load float, float addrspace(3)* %ptr0			%val0 = load float, float addrspace(3)* %ptr0

	%ptr1 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 256			%ptr1 = getelementptr inbounds [512 x float], [512 x float] addrspace(3)* @lds, i32 0, i32 256
	%val1 = load float, float addrspace(3)* %ptr1			%val1 = load float, float addrspace(3)* %ptr1
	%add1 = fadd float %val0, %val1			%add1 = fadd float %val0, %val1
	Show All 23 Lines

llvm/test/CodeGen/AMDGPU/ds_write2.ll

Show First 20 Lines • Show All 814 Lines • ▼ Show 20 Lines	; GFX9-NEXT: s_endpgm
ret void		ret void
}		}

@bar = addrspace(3) global [4 x i64] undef, align 4		@bar = addrspace(3) global [4 x i64] undef, align 4

define amdgpu_kernel void @store_misaligned64_constant_offsets() {		define amdgpu_kernel void @store_misaligned64_constant_offsets() {
; CI-LABEL: store_misaligned64_constant_offsets:		; CI-LABEL: store_misaligned64_constant_offsets:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: v_mov_b32_e32 v0, 0		; CI-NEXT: s_movk_i32 s0, 0x7b
; CI-NEXT: v_mov_b32_e32 v1, 0x7b		; CI-NEXT: s_mov_b32 s1, 0
		; CI-NEXT: v_mov_b32_e32 v0, s0
		; CI-NEXT: v_mov_b32_e32 v2, 0
		; CI-NEXT: v_mov_b32_e32 v1, s1
; CI-NEXT: s_mov_b32 m0, -1		; CI-NEXT: s_mov_b32 m0, -1
; CI-NEXT: ds_write2_b32 v0, v1, v0 offset1:1		; CI-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset1:1
; CI-NEXT: ds_write2_b32 v0, v1, v0 offset0:2 offset1:3
; CI-NEXT: s_endpgm		; CI-NEXT: s_endpgm
;		;
; GFX9-ALIGNED-LABEL: store_misaligned64_constant_offsets:		; GFX9-ALIGNED-LABEL: store_misaligned64_constant_offsets:
; GFX9-ALIGNED: ; %bb.0:		; GFX9-ALIGNED: ; %bb.0:
; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v0, 0		; GFX9-ALIGNED-NEXT: s_movk_i32 s0, 0x7b
; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX9-ALIGNED-NEXT: s_mov_b32 s1, 0
; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v0 offset1:1		; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v0, s0
; GFX9-ALIGNED-NEXT: ds_write2_b32 v0, v1, v0 offset0:2 offset1:3		; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v2, 0
		; GFX9-ALIGNED-NEXT: v_mov_b32_e32 v1, s1
		; GFX9-ALIGNED-NEXT: ds_write2_b64 v2, v[0:1], v[0:1] offset1:1
; GFX9-ALIGNED-NEXT: s_endpgm		; GFX9-ALIGNED-NEXT: s_endpgm
;		;
; GFX9-UNALIGNED-LABEL: store_misaligned64_constant_offsets:		; GFX9-UNALIGNED-LABEL: store_misaligned64_constant_offsets:
; GFX9-UNALIGNED: ; %bb.0:		; GFX9-UNALIGNED: ; %bb.0:
; GFX9-UNALIGNED-NEXT: s_movk_i32 s0, 0x7b		; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, 0x7b
; GFX9-UNALIGNED-NEXT: s_mov_b32 s1, 0		; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, 0
; GFX9-UNALIGNED-NEXT: s_mov_b32 s2, s0		; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, v0
; GFX9-UNALIGNED-NEXT: s_mov_b32 s3, s1		; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v3, v1
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v0, s0		; GFX9-UNALIGNED-NEXT: ds_write_b128 v1, v[0:3]
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v2, s2
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v4, 0
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v1, s1
; GFX9-UNALIGNED-NEXT: v_mov_b32_e32 v3, s3
; GFX9-UNALIGNED-NEXT: ds_write2_b64 v4, v[0:1], v[2:3] offset1:1
; GFX9-UNALIGNED-NEXT: s_endpgm		; GFX9-UNALIGNED-NEXT: s_endpgm
store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4		store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 0), align 4
store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4		store i64 123, i64 addrspace(3)* getelementptr inbounds ([4 x i64], [4 x i64] addrspace(3)* @bar, i32 0, i32 1), align 4
ret void		ret void
}		}

@bar.large = addrspace(3) global [4096 x i64] undef, align 4		@bar.large = addrspace(3) global [4096 x i64] undef, align 4

define amdgpu_kernel void @store_misaligned64_constant_large_offsets() {		define amdgpu_kernel void @store_misaligned64_constant_large_offsets() {
; CI-LABEL: store_misaligned64_constant_large_offsets:		; CI-LABEL: store_misaligned64_constant_large_offsets:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: v_mov_b32_e32 v0, 0x4000		; CI-NEXT: s_movk_i32 s0, 0x7b
; CI-NEXT: v_mov_b32_e32 v1, 0x7b		; CI-NEXT: s_mov_b32 s1, 0
		; CI-NEXT: v_mov_b32_e32 v0, s0
; CI-NEXT: v_mov_b32_e32 v2, 0		; CI-NEXT: v_mov_b32_e32 v2, 0
		; CI-NEXT: v_mov_b32_e32 v1, s1
; CI-NEXT: s_mov_b32 m0, -1		; CI-NEXT: s_mov_b32 m0, -1
; CI-NEXT: ds_write2_b32 v0, v1, v2 offset1:1		; CI-NEXT: ds_write_b64 v2, v[0:1] offset:16384
; CI-NEXT: v_mov_b32_e32 v0, 0x7ff8		; CI-NEXT: ds_write_b64 v2, v[0:1] offset:32760
; CI-NEXT: ds_write2_b32 v0, v1, v2 offset1:1
; CI-NEXT: s_endpgm		; CI-NEXT: s_endpgm
;		;
; GFX9-LABEL: store_misaligned64_constant_large_offsets:		; GFX9-LABEL: store_misaligned64_constant_large_offsets:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: v_mov_b32_e32 v0, 0x4000		; GFX9-NEXT: s_movk_i32 s0, 0x7b
; GFX9-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX9-NEXT: s_mov_b32 s1, 0
		; GFX9-NEXT: v_mov_b32_e32 v0, s0
; GFX9-NEXT: v_mov_b32_e32 v2, 0		; GFX9-NEXT: v_mov_b32_e32 v2, 0
; GFX9-NEXT: ds_write2_b32 v0, v1, v2 offset1:1		; GFX9-NEXT: v_mov_b32_e32 v1, s1
; GFX9-NEXT: v_mov_b32_e32 v0, 0x7ff8		; GFX9-NEXT: ds_write_b64 v2, v[0:1] offset:16384
; GFX9-NEXT: ds_write2_b32 v0, v1, v2 offset1:1		; GFX9-NEXT: ds_write_b64 v2, v[0:1] offset:32760
; GFX9-NEXT: s_endpgm		; GFX9-NEXT: s_endpgm
store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4		store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 2048), align 4
store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4		store i64 123, i64 addrspace(3)* getelementptr inbounds ([4096 x i64], [4096 x i64] addrspace(3)* @bar.large, i32 0, i32 4095), align 4
ret void		ret void
}		}

@sgemm.lA = internal unnamed_addr addrspace(3) global [264 x float] undef, align 4		@sgemm.lA = internal unnamed_addr addrspace(3) global [264 x float] undef, align 4
@sgemm.lB = internal unnamed_addr addrspace(3) global [776 x float] undef, align 4		@sgemm.lB = internal unnamed_addr addrspace(3) global [776 x float] undef, align 4
▲ Show 20 Lines • Show All 191 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/lds-alignment.ll

Show All 40 Lines	define amdgpu_kernel void @test_round_size_2(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {

%lds.align16.1.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.1 to i8 addrspace(3)*		%lds.align16.1.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.1 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.align16.1.bc, i8 addrspace(1)* align 4 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.align16.1.bc, i8 addrspace(1)* align 4 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.align16.1.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.align16.1.bc, i32 38, i1 false)

ret void		ret void
}		}

; 38 + (2 pad) + 38		; 38 + (10 pad) + 38 (= 86)
; HSA-LABEL: {{^}}test_round_size_2_align_8:		; HSA-LABEL: {{^}}test_round_size_2_align_8:
; HSA: workgroup_group_segment_byte_size = 78		; HSA: workgroup_group_segment_byte_size = 86
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_2_align_8(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_2_align_8(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
Show All 29 Lines
; HSA: workgroup_group_segment_byte_size = 0		; HSA: workgroup_group_segment_byte_size = 0
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_high_align_lds_arg(i8 addrspace(1)* %out, i8 addrspace(1)* %in, i8 addrspace(3)* align 64 %lds.arg) #1 {		define amdgpu_kernel void @test_high_align_lds_arg(i8 addrspace(1)* %out, i8 addrspace(1)* %in, i8 addrspace(3)* align 64 %lds.arg) #1 {
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 64 %lds.arg, i8 addrspace(1)* align 64 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 64 %lds.arg, i8 addrspace(1)* align 64 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 64 %out, i8 addrspace(3)* align 64 %lds.arg, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 64 %out, i8 addrspace(3)* align 64 %lds.arg, i32 38, i1 false)
ret void		ret void
}		}

; FIXME: missign alignment can be improved.
; (39 * 4) + (4 pad) + (7 * 8) = 216		; (39 * 4) + (4 pad) + (7 * 8) = 216
; HSA-LABEL: {{^}}test_missing_alignment_size_2_order0:		; HSA-LABEL: {{^}}test_missing_alignment_size_2_order0:
; HSA: workgroup_group_segment_byte_size = 216		; HSA: workgroup_group_segment_byte_size = 216
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_missing_alignment_size_2_order0(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_missing_alignment_size_2_order0(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.missing.align.0.bc = bitcast [39 x i32] addrspace(3)* @lds.missing.align.0 to i8 addrspace(3)*		%lds.missing.align.0.bc = bitcast [39 x i32] addrspace(3)* @lds.missing.align.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i8 addrspace(1)* align 4 %in, i32 160, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i8 addrspace(1)* align 4 %in, i32 160, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i32 160, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i32 160, i1 false)
Show All 16 Lines	define amdgpu_kernel void @test_missing_alignment_size_2_order1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {

%lds.missing.align.0.bc = bitcast [39 x i32] addrspace(3)* @lds.missing.align.0 to i8 addrspace(3)*		%lds.missing.align.0.bc = bitcast [39 x i32] addrspace(3)* @lds.missing.align.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i8 addrspace(1)* align 4 %in, i32 160, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i8 addrspace(1)* align 4 %in, i32 160, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i32 160, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 4 %out, i8 addrspace(3)* align 4 %lds.missing.align.0.bc, i32 160, i1 false)

ret void		ret void
}		}

; align 32, 16, 8		; align 32, 16, 16
; 38 + (10 pad) + 38 + (2 pad) + 38 = 126		; 38 + (10 pad) + 38 + (10 pad) + 38 ( = 134)
		rampitecUnsubmitted Done Reply Inline Actions Calculation comments need to be updated in the test. rampitec: Calculation comments need to be updated in the test.
; HSA-LABEL: {{^}}test_round_size_3_order0:		; HSA-LABEL: {{^}}test_round_size_3_order0:
; HSA: workgroup_group_segment_byte_size = 126		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order0(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order0(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 32, 8, 16		; align 32, 16, 16
; 38 (+ 10 pad) + 38 + (2 pad) + 38 = 126		; 38 (+ 10 pad) + 38 + (10 pad) + 38 ( = 134)
; HSA-LABEL: {{^}}test_round_size_3_order1:		; HSA-LABEL: {{^}}test_round_size_3_order1:
; HSA: workgroup_group_segment_byte_size = 126		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order1(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 16, 32, 8		; align 32, 16, 16
; 38 + (10 pad) + 38 + (2 pad) + 38 = 126		; 38 + (10 pad) + 38 + (10 pad) + 38 ( = 126)
; HSA-LABEL: {{^}}test_round_size_3_order2:		; HSA-LABEL: {{^}}test_round_size_3_order2:
; HSA: workgroup_group_segment_byte_size = 126		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order2(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order2(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; FIXME: Improve alignment		; align 32, 16, 16
; align 16, 8, 32		; 38 + (10 pad) + 38 + (10 pad) + 38 ( = 134)
; 38 + (10 pad) + 38 + (2 pad) + 38
; HSA-LABEL: {{^}}test_round_size_3_order3:		; HSA-LABEL: {{^}}test_round_size_3_order3:
; HSA: workgroup_group_segment_byte_size = 126		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order3(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order3(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 8, 32, 16		; align 32, 16, 16
; 38 + (10 pad) + 38 + (2 pad) + 38 = 126		; 38 + (10 pad) + 38 + (10 pad) + 38 (= 134)
; HSA-LABEL: {{^}}test_round_size_3_order4:		; HSA-LABEL: {{^}}test_round_size_3_order4:
; HSA: workgroup_group_segment_byte_size = 126		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order4(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order4(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*		%lds.align32.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align32.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align32.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align32.0.bc, i32 38, i1 false)

%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align16.0.bc, i32 38, i1 false)

ret void		ret void
}		}

; align 8, 16, 32		; align 32, 16, 16
; 38 + (10 pad) + 38 + (2 pad) + 38 = 126		; 38 + (10 pad) + 38 + (10 pad) + 38 (= 134)
; HSA-LABEL: {{^}}test_round_size_3_order5:		; HSA-LABEL: {{^}}test_round_size_3_order5:
; HSA: workgroup_group_segment_byte_size = 126		; HSA: workgroup_group_segment_byte_size = 134
; HSA: group_segment_alignment = 4		; HSA: group_segment_alignment = 4
define amdgpu_kernel void @test_round_size_3_order5(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {		define amdgpu_kernel void @test_round_size_3_order5(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #1 {
%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*		%lds.align8.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align8.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align8.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)		call void @llvm.memcpy.p1i8.p3i8.i32(i8 addrspace(1)* align 8 %out, i8 addrspace(3)* align 8 %lds.align8.0.bc, i32 38, i1 false)

%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*		%lds.align16.0.bc = bitcast [38 x i8] addrspace(3)* @lds.align16.0 to i8 addrspace(3)*
call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)		call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 8 %lds.align16.0.bc, i8 addrspace(1)* align 8 %in, i32 38, i1 false)
Show All 12 Lines

llvm/test/CodeGen/AMDGPU/promote-alloca-globals.ll

	; RUN: opt -data-layout=A5 -S -mtriple=amdgcn-unknown-unknown -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s			; RUN: opt -data-layout=A5 -S -mtriple=amdgcn-unknown-unknown -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s
	; RUN: llc -march=amdgcn -mcpu=tonga < %s \| FileCheck -check-prefix=ASM %s			; RUN: llc -march=amdgcn -mcpu=tonga < %s \| FileCheck -check-prefix=ASM %s


	@global_array0 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4			@global_array0 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4
	@global_array1 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4			@global_array1 = internal unnamed_addr addrspace(3) global [750 x [10 x i32]] undef, align 4

	; IR-LABEL: define amdgpu_kernel void @promote_alloca_size_256(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) {			; IR-LABEL: define amdgpu_kernel void @promote_alloca_size_256(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) {
	; IR: alloca [10 x i32]			; IR: alloca [10 x i32]
	; ASM-LABEL: {{^}}promote_alloca_size_256:			; ASM-LABEL: {{^}}promote_alloca_size_256:
	; ASM: .amdgpu_lds llvm.amdgcn.kernel.promote_alloca_size_256.lds, 60000, 4			; ASM: .amdgpu_lds llvm.amdgcn.kernel.promote_alloca_size_256.lds, 60000, 16
	; ASM-NOT: .amdgpu_lds			; ASM-NOT: .amdgpu_lds

	define amdgpu_kernel void @promote_alloca_size_256(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) {			define amdgpu_kernel void @promote_alloca_size_256(i32 addrspace(1)* nocapture %out, i32 addrspace(1)* nocapture %in) {
	entry:			entry:
	%stack = alloca [10 x i32], align 4, addrspace(5)			%stack = alloca [10 x i32], align 4, addrspace(5)
	%tmp = load i32, i32 addrspace(1)* %in, align 4			%tmp = load i32, i32 addrspace(1)* %in, align 4
	%arrayidx1 = getelementptr inbounds [10 x i32], [10 x i32] addrspace(5)* %stack, i32 0, i32 %tmp			%arrayidx1 = getelementptr inbounds [10 x i32], [10 x i32] addrspace(5)* %stack, i32 0, i32 %tmp
	store i32 4, i32 addrspace(5)* %arrayidx1, align 4			store i32 4, i32 addrspace(5)* %arrayidx1, align 4
	Show All 17 Lines

llvm/test/CodeGen/AMDGPU/update-lds-alignment.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-lower-module-lds < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds < %s \| FileCheck %s

				; Properly aligned, same size as alignment.
				; CHECK: %llvm.amdgcn.kernel.k0.lds.t = type { [16 x i8], [8 x i8], [4 x i8], [2 x i8], [1 x i8] }

				; Different properly aligned values, but same size of 1.
				; CHECK: %llvm.amdgcn.kernel.k1.lds.t = type { [1 x i8], [7 x i8], [1 x i8], [3 x i8], [1 x i8], [1 x i8], [1 x i8], [1 x i8] }

				; All are under-aligned, requires to fix each on different alignment boundary.
				; CHECK: %llvm.amdgcn.kernel.k2.lds.t = type { [9 x i8], [7 x i8], [5 x i8], [3 x i8], [3 x i8], [1 x i8], [2 x i8] }

				; All LDS are underaligned, requires to allocate on 8 byte boundary
				; CHECK: %llvm.amdgcn.kernel.k3.lds.t = type { [7 x i8], [1 x i8], [7 x i8], [1 x i8], [6 x i8], [2 x i8], [5 x i8] }

				; All LDS are underaligned, requires to allocate on 16 byte boundary
				; CHECK: %llvm.amdgcn.kernel.k4.lds.t = type { [12 x i8], [4 x i8], [11 x i8], [5 x i8], [10 x i8], [6 x i8], [9 x i8] }

				; All LDS are properly aligned on 16 byte boundary, but they are of different size.
				; CHECK: %llvm.amdgcn.kernel.k5.lds.t = type { [20 x i8], [12 x i8], [19 x i8], [13 x i8], [18 x i8], [14 x i8], [17 x i8] }

				; CHECK: @llvm.amdgcn.kernel.k0.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k0.lds.t undef, align 16
				; CHECK: @llvm.amdgcn.kernel.k1.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k1.lds.t undef, align 16
				; CHECK: @llvm.amdgcn.kernel.k2.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k2.lds.t undef, align 16
				; CHECK: @llvm.amdgcn.kernel.k3.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k3.lds.t undef, align 8
				; CHECK: @llvm.amdgcn.kernel.k4.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k4.lds.t undef, align 16
				; CHECK: @llvm.amdgcn.kernel.k5.lds = internal addrspace(3) global %llvm.amdgcn.kernel.k5.lds.t undef, align 16


				; Properly aligned, same size as alignment.
				; CHECK-NOT: @k0.lds.size.1.align.1
				; CHECK-NOT: @k0.lds.size.2.align.2
				; CHECK-NOT: @k0.lds.size.4.align.4
				; CHECK-NOT: @k0.lds.size.8.align.8
				; CHECK-NOT: @k0.lds.size.16.align.16
				@k0.lds.size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
				@k0.lds.size.2.align.2 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 2
				@k0.lds.size.4.align.4 = internal unnamed_addr addrspace(3) global [4 x i8] undef, align 4
				@k0.lds.size.8.align.8 = internal unnamed_addr addrspace(3) global [8 x i8] undef, align 8
				@k0.lds.size.16.align.16 = internal unnamed_addr addrspace(3) global [16 x i8] undef, align 16

				define amdgpu_kernel void @k0() {
				%k0.lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* @k0.lds.size.1.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %k0.lds.size.1.align.1.bc, align 1

				%k0.lds.size.2.align.2.bc = bitcast [2 x i8] addrspace(3)* @k0.lds.size.2.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %k0.lds.size.2.align.2.bc, align 2

				%k0.lds.size.4.align.4.bc = bitcast [4 x i8] addrspace(3)* @k0.lds.size.4.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %k0.lds.size.4.align.4.bc, align 4

				%k0.lds.size.8.align.8.bc = bitcast [8 x i8] addrspace(3)* @k0.lds.size.8.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %k0.lds.size.8.align.8.bc, align 8

				%k0.lds.size.16.align.16.bc = bitcast [16 x i8] addrspace(3)* @k0.lds.size.16.align.16 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %k0.lds.size.16.align.16.bc, align 16

				ret void
				}

				; Different properly aligned values, but same size of 1.
				; CHECK-NOT: @k1.lds.size.1.align.1
				; CHECK-NOT: @k1.lds.size.1.align.2
				; CHECK-NOT: @k1.lds.size.1.align.4
				; CHECK-NOT: @k1.lds.size.1.align.8
				; CHECK-NOT: @k1.lds.size.1.align.16
				@k1.lds.size.1.align.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
				@k1.lds.size.1.align.2 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 2
				@k1.lds.size.1.align.4 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 4
				@k1.lds.size.1.align.8 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 8
				@k1.lds.size.1.align.16 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 16

				define amdgpu_kernel void @k1() {
				%k1.lds.size.1.align.1.bc = bitcast [1 x i8] addrspace(3)* @k1.lds.size.1.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %k1.lds.size.1.align.1.bc, align 1

				%k1.lds.size.1.align.2.bc = bitcast [1 x i8] addrspace(3)* @k1.lds.size.1.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %k1.lds.size.1.align.2.bc, align 2

				%k1.lds.size.1.align.4.bc = bitcast [1 x i8] addrspace(3)* @k1.lds.size.1.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %k1.lds.size.1.align.4.bc, align 4

				%k1.lds.size.1.align.8.bc = bitcast [1 x i8] addrspace(3)* @k1.lds.size.1.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %k1.lds.size.1.align.8.bc, align 8

				%k1.lds.size.1.align.16.bc = bitcast [1 x i8] addrspace(3)* @k1.lds.size.1.align.16 to i8 addrspace(3)*
				store i8 5, i8 addrspace(3)* %k1.lds.size.1.align.16.bc, align 16

				ret void
				}

				; All are under-aligned, requires to fix each on different alignment boundary.
				; CHECK-NOT: @k2.lds.size.2.align.1
				; CHECK-NOT: @k2.lds.size.3.align.2
				; CHECK-NOT: @k2.lds.size.5.align.4
				; CHECK-NOT: @k2.lds.size.9.align.8
				@k2.lds.size.2.align.1 = internal unnamed_addr addrspace(3) global [2 x i8] undef, align 1
				@k2.lds.size.3.align.2 = internal unnamed_addr addrspace(3) global [3 x i8] undef, align 2
				@k2.lds.size.5.align.4 = internal unnamed_addr addrspace(3) global [5 x i8] undef, align 4
				@k2.lds.size.9.align.8 = internal unnamed_addr addrspace(3) global [9 x i8] undef, align 8

				define amdgpu_kernel void @k2() {
				%k2.lds.size.2.align.1.bc = bitcast [2 x i8] addrspace(3)* @k2.lds.size.2.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %k2.lds.size.2.align.1.bc, align 1

				%k2.lds.size.3.align.2.bc = bitcast [3 x i8] addrspace(3)* @k2.lds.size.3.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %k2.lds.size.3.align.2.bc, align 2

				%k2.lds.size.5.align.4.bc = bitcast [5 x i8] addrspace(3)* @k2.lds.size.5.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %k2.lds.size.5.align.4.bc, align 4

				%k2.lds.size.9.align.8.bc = bitcast [9 x i8] addrspace(3)* @k2.lds.size.9.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %k2.lds.size.9.align.8.bc, align 8

				ret void
				}

				; All LDS are underaligned, requires to allocate on 8 byte boundary
				; CHECK-NOT: @k3.lds.size.5.align.2
				; CHECK-NOT: @k3.lds.size.6.align.2
				; CHECK-NOT: @k3.lds.size.7.align.2
				; CHECK-NOT: @k3.lds.size.7.align.4
				@k3.lds.size.5.align.2 = internal unnamed_addr addrspace(3) global [5 x i8] undef, align 2
				@k3.lds.size.6.align.2 = internal unnamed_addr addrspace(3) global [6 x i8] undef, align 2
				@k3.lds.size.7.align.2 = internal unnamed_addr addrspace(3) global [7 x i8] undef, align 2
				@k3.lds.size.7.align.4 = internal unnamed_addr addrspace(3) global [7 x i8] undef, align 4

				define amdgpu_kernel void @k3() {
				%k3.lds.size.5.align.2.bc = bitcast [5 x i8] addrspace(3)* @k3.lds.size.5.align.2 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %k3.lds.size.5.align.2.bc, align 2

				%k3.lds.size.6.align.2.bc = bitcast [6 x i8] addrspace(3)* @k3.lds.size.6.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %k3.lds.size.6.align.2.bc, align 2

				%k3.lds.size.7.align.2.bc = bitcast [7 x i8] addrspace(3)* @k3.lds.size.7.align.2 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %k3.lds.size.7.align.2.bc, align 2

				%k3.lds.size.7.align.4.bc = bitcast [7 x i8] addrspace(3)* @k3.lds.size.7.align.4 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %k3.lds.size.7.align.4.bc, align 4

				ret void
				}

				; All LDS are underaligned, requires to allocate on 16 byte boundary
				; CHECK-NOT: @k4.lds.size.9.align.1
				; CHECK-NOT: @k4.lds.size.10.align.2
				; CHECK-NOT: @k4.lds.size.11.align.4
				; CHECK-NOT: @k4.lds.size.12.align.8
				@k4.lds.size.9.align.1 = internal unnamed_addr addrspace(3) global [9 x i8] undef, align 1
				@k4.lds.size.10.align.2 = internal unnamed_addr addrspace(3) global [10 x i8] undef, align 2
				@k4.lds.size.11.align.4 = internal unnamed_addr addrspace(3) global [11 x i8] undef, align 4
				@k4.lds.size.12.align.8 = internal unnamed_addr addrspace(3) global [12 x i8] undef, align 8

				define amdgpu_kernel void @k4() {
				%k4.lds.size.9.align.1.bc = bitcast [9 x i8] addrspace(3)* @k4.lds.size.9.align.1 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %k4.lds.size.9.align.1.bc, align 1

				%k4.lds.size.10.align.2.bc = bitcast [10 x i8] addrspace(3)* @k4.lds.size.10.align.2 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %k4.lds.size.10.align.2.bc, align 2

				%k4.lds.size.11.align.4.bc = bitcast [11 x i8] addrspace(3)* @k4.lds.size.11.align.4 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %k4.lds.size.11.align.4.bc, align 4

				%k4.lds.size.12.align.8.bc = bitcast [12 x i8] addrspace(3)* @k4.lds.size.12.align.8 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %k4.lds.size.12.align.8.bc, align 8

				ret void
				}

				; CHECK-NOT: @k5.lds.size.17.align.16
				; CHECK-NOT: @k5.lds.size.18.align.16
				; CHECK-NOT: @k5.lds.size.19.align.16
				; CHECK-NOT: @k5.lds.size.20.align.16
				@k5.lds.size.17.align.16 = internal unnamed_addr addrspace(3) global [17 x i8] undef, align 16
				@k5.lds.size.18.align.16 = internal unnamed_addr addrspace(3) global [18 x i8] undef, align 16
				@k5.lds.size.19.align.16 = internal unnamed_addr addrspace(3) global [19 x i8] undef, align 16
				@k5.lds.size.20.align.16 = internal unnamed_addr addrspace(3) global [20 x i8] undef, align 16

				define amdgpu_kernel void @k5() {
				%k5.lds.size.17.align.16.bc = bitcast [17 x i8] addrspace(3)* @k5.lds.size.17.align.16 to i8 addrspace(3)*
				store i8 1, i8 addrspace(3)* %k5.lds.size.17.align.16.bc, align 16

				%k5.lds.size.18.align.16.bc = bitcast [18 x i8] addrspace(3)* @k5.lds.size.18.align.16 to i8 addrspace(3)*
				store i8 2, i8 addrspace(3)* %k5.lds.size.18.align.16.bc, align 16

				%k5.lds.size.19.align.16.bc = bitcast [19 x i8] addrspace(3)* @k5.lds.size.19.align.16 to i8 addrspace(3)*
				store i8 3, i8 addrspace(3)* %k5.lds.size.19.align.16.bc, align 16

				%k5.lds.size.20.align.16.bc = bitcast [20 x i8] addrspace(3)* @k5.lds.size.20.align.16 to i8 addrspace(3)*
				store i8 4, i8 addrspace(3)* %k5.lds.size.20.align.16.bc, align 16

				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 349761

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

llvm/test/CodeGen/AMDGPU/GlobalISel/lds-global-value.ll

llvm/test/CodeGen/AMDGPU/ds_read2.ll

llvm/test/CodeGen/AMDGPU/ds_read2_offset_order.ll

llvm/test/CodeGen/AMDGPU/ds_write2.ll

llvm/test/CodeGen/AMDGPU/lds-alignment.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-globals.ll

llvm/test/CodeGen/AMDGPU/update-lds-alignment.ll

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
ClosedPublic