Page MenuHomePhabricator

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
ClosedPublic

Authored by hsmhsm on May 27 2021, 9:02 AM.

Details

Summary

Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.

Diff Detail

Unit TestsFailed

TimeTest
260 msx64 debian > LLVM.CodeGen/AMDGPU::32-bit-local-address-space.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -march=amdgcn -mcpu=bonaire -verify-machineinstrs < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/32-bit-local-address-space.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefixes=SI,FUNC,GFX7 /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/32-bit-local-address-space.ll
260 msx64 debian > LLVM.CodeGen/AMDGPU::amdpal-callable.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=amdgcn--amdpal -verify-machineinstrs < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefixes=GCN,SDAG -enable-var-scope /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
250 msx64 debian > LLVM.CodeGen/AMDGPU::divergence-at-use.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -march=amdgcn -mcpu=gfx900 -amdgpu-atomic-optimizations=true < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/divergence-at-use.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/divergence-at-use.ll
270 msx64 debian > LLVM.CodeGen/AMDGPU::global-saddr-atomics.ll
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/global-saddr-atomics.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefixes=GCN,GFX9 /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/global-saddr-atomics.ll
270 msx64 debian > LLVM.CodeGen/AMDGPU::global-saddr-load.ll
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 < /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/global-saddr-load.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck -check-prefixes=GCN,GFX9 /var/lib/buildkite-agent/builds/llvm-project/llvm/test/CodeGen/AMDGPU/global-saddr-load.ll
View Full Test Results (40 Failed)

Event Timeline

hsmhsm created this revision.May 27 2021, 9:02 AM
hsmhsm requested review of this revision.May 27 2021, 9:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2021, 9:02 AM

Thanks! I'd suggest to combine all fix-lds-alignmen*.ll tests into one. Just use different names for different blocks of variables so they are used only in one kernel.

llvm/test/CodeGen/AMDGPU/lds-alignment.ll
129–130

Calculation comments need to be updated in the test.

hsmhsm updated this revision to Diff 348336.May 27 2021, 11:16 AM

Fixed review comments by Stas.

hsmhsm marked an inline comment as done.May 27 2021, 11:17 AM

What's the use case for this? It will increase memory use if it increases the alignment of any variables

rampitec added a comment.EditedMay 27 2021, 12:24 PM

What's the use case for this? It will increase memory use if it increases the alignment of any variables

It may increase fragmentation if we have a lot of small underaligned arrays, but in general superalignment should give a better performance. We are trading memory for performance. Changes exposed by the ISA tests are mostly positive. We may want to add a threshold for the FoundLocalVars.size() to inhibit the superalignment as the fragmentation is a function of the number of variables. Let's say in a worst case we will waste 15 bytes. To stay below 1Kb the threshold would be 68 which is plenty of variables. In reality fragmentation will be even less as we are not going to waste maximum. The other option is to compute total allocation and only superalign if ST.getOccupancyWithLocalMemSize() does not drop (and we do not exceed getLocalMemorySize() of course). The latter is more expensive but shall work better than a simple threshold. AMDGPUSubtarget::getMaxLocalMemSizeWithWaveCount() can be used to simplify the logic (one can check AMDGPUPromoteAlloca.cpp for the usage).

I'd say I am in favor of this change.

Do we actually see underaligned LDS variables in practice? What if the user is forcing a lower alignment to prioritize space utilization?

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
180–181

Separating these functions is just making this harder to follow. How about just one function to fully inspect the GV and increase alignment

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
153 ↗(On Diff #348336)

"Fix" isn't very descriptive. How about increaseAlignment?

155 ↗(On Diff #348336)

This is ignoring the explicit alignment

Do we actually see underaligned LDS variables in practice? What if the user is forcing a lower alignment to prioritize space utilization?

It shall be mostly cases like "local char a[2]; local short b[3];" etc.

hsmhsm updated this revision to Diff 348442.May 27 2021, 9:24 PM

Fixed review comments by Matt.

hsmhsm retitled this revision from [AMDGPU] Fix natural alignment of LDS globals during LDS lowering. to [AMDGPU] Increase natural alignment of LDS globals, if required, during LDS lowering..May 27 2021, 9:26 PM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm marked 3 inline comments as done.
hsmhsm retitled this revision from [AMDGPU] Increase natural alignment of LDS globals, if required, during LDS lowering. to [AMDGPU] If required, increase natural alignment of LDS globals before LDS lowering..May 27 2021, 9:30 PM
hsmhsm updated this revision to Diff 348445.May 27 2021, 10:14 PM

Chaged lit test file name from fix-lds-alignment.ll to update-lds-alignment.ll.

hsmhsm updated this revision to Diff 348446.May 27 2021, 10:38 PM

Remove fixme comment "; FIXME: Improve alignment" within lit test file lds-alignment.ll since it is fixed now.

foad added inline comments.May 28 2021, 1:38 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
159–161 ↗(On Diff #348446)

I think this needs a better explanation. Normally increasing the alignment would still be considered to "honor" the original alignment.

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
45–48 ↗(On Diff #348446)

I don't think it's a good idea to have three copies of this block comment (maybe just keep this one?) and I don't particularly like the use of "under-aligned" and "natural alignment" here. I would say something like: "Increase the alignment of LDS globals if necessary to maximise the chance that we can use aligned LDS instructions to access them"

hsmhsm updated this revision to Diff 348466.May 28 2021, 2:02 AM

Fix review comments by Jay.

hsmhsm marked 2 inline comments as done.May 28 2021, 2:05 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
159–161 ↗(On Diff #348446)

I was thinking about "#pragma clang section" defined for globals. But, now I think, it is not of a much meaning to LDS. And, I guess, it is programmatically and semantically fine to increase the alignment of LDS if it is necessary. Hence, remove the check itself.

hsmhsm retitled this revision from [AMDGPU] If required, increase natural alignment of LDS globals before LDS lowering. to [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering..May 28 2021, 2:12 AM
hsmhsm edited the summary of this revision. (Show Details)
foad added inline comments.May 28 2021, 2:45 AM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
47 ↗(On Diff #348466)

I think the first argument can be simpler: ArrayRef<GlobalVariable *> LDSGlobals.

hsmhsm updated this revision to Diff 348482.May 28 2021, 3:07 AM
hsmhsm marked an inline comment as done.

Fix review comment by Jay - use ArrayRef.

hsmhsm marked an inline comment as done.May 28 2021, 3:08 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
47 ↗(On Diff #348466)

Oh! yes, it much nicer.

hsmhsm marked an inline comment as done.May 28 2021, 3:09 AM
JonChesterfield added a comment.EditedMay 28 2021, 12:34 PM

Not really the right place for this comment, but better odds of it being seen than in other bug trackers. Mahesha reports:

@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8
@llvm.used = appending global [2 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @lds to i8 addrspace(3)*) to i8*), i8* addrspacecast (i8 addrspace(1)* bitcast (i64* addrspace(1)* @gptr to i8 addrspace(1)*) to i8*)], section "llvm.metadata"
define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}

^ that doesn't get transformed, suggesting a bug in shouldLowerLDSToStruct. The pattern of return !F for conservative should probably be repeated where it currently has return false. I thought we had a test case for storing address of LDS in a global, but presumably not.

Not really the right place for this comment, but better odds of it being seen than in other bug trackers. Mahesha reports:

@lds = addrspace(3) global float undef, align 8
@gptr = addrspace(1) global i64* addrspacecast (float addrspace(3)* @lds to i64*), align 8
@llvm.used = appending global [2 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (float addrspace(3)* @lds to i8 addrspace(3)*) to i8*), i8* addrspacecast (i8 addrspace(1)* bitcast (i64* addrspace(1)* @gptr to i8 addrspace(1)*) to i8*)], section "llvm.metadata"
define void @f0() {
  %ld = load i64*, i64* addrspace(1)* @gptr
  ret void
}

^ that doesn't get transformed, suggesting a bug in shouldLowerLDSToStruct. The pattern of return !F for conservative should probably be repeated where it currently has return false. I thought we had a test case for storing address of LDS in a global, but presumably not.

In general, shouldLowerLDSToStruct() requires some clean-up and improvement in order to handle all scenarios including handling of constants for kernel LDS lowering, which we have not yet handled. Let me take all these missing things in a separate patch.

hsmhsm updated this revision to Diff 348803.May 31 2021, 8:07 AM

Rebased.

I'm not very convinced by this given the recent enthusiasm for decreasing LDS usage. Types are naturally aligned by default, so I think the only time they are aligned by less than that is when the programmer asked for it (or when the vectorizer is involved, but that's disabled on amdgpu afaik).

If someone has a char data[16] align(4) in LDS, i'm not at all sure it's obvious that they want the alignment increased to 16, despite that using more LDS than they asked for.

We could do something more conservative, where we put variables in order based on their sizes and alignments, and then go through the resulting struct and tag variables with the additional alignment they happen to have as a result of position in the struct. That would be pure performance win, zero storage overhead cost. It leaves the choice to burn memory in favour of faster instructions in the hands of the developer writing the code.

hsmhsm added a comment.Jun 2 2021, 4:20 AM

I'm not very convinced by this given the recent enthusiasm for decreasing LDS usage. Types are naturally aligned by default, so I think the only time they are aligned by less than that is when the programmer asked for it (or when the vectorizer is involved, but that's disabled on amdgpu afaik).

If someone has a char data[16] align(4) in LDS, i'm not at all sure it's obvious that they want the alignment increased to 16, despite that using more LDS than they asked for.

We could do something more conservative, where we put variables in order based on their sizes and alignments, and then go through the resulting struct and tag variables with the additional alignment they happen to have as a result of position in the struct. That would be pure performance win, zero storage overhead cost. It leaves the choice to burn memory in favour of faster instructions in the hands of the developer writing the code.

The main intention behind this patch is to change

char data[16] __align__(4)

to

char data[16] __align__(16)

since data should be aligned at 16 bytes boundary. Theoritically speaking, it may not correct to change the value mentined by porgrammer. But, from the practical point of view, since increasing the aligment value is programatically safe and since it fixes the performance issues due to unaligned access, I guess, it is fine to change it.

The usual approach to reduce the memory overhead due to padding within struct type is - to first sort the members based on their size, before padding [ Data structure alignment ] . But, I guess, we are sorting here based on alignment first, and then based on size. May be we need to revisit it.

I do not understand what you mean by - "and then go through the resulting struct and tag variables with the additional alignment they happen to have as a result of position in the struct"

rampitec accepted this revision.Jun 3 2021, 1:29 PM

I like the ISA changes, I'd say let's try it. If needed we could restrict it later with total LDS consumption calculation.
@hsmhsm please let a day before the submit in case if people have strong objections.

This revision is now accepted and ready to land.Jun 3 2021, 1:29 PM
arsenm added inline comments.Jun 3 2021, 2:43 PM
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
156 ↗(On Diff #348803)

I think this is a bad helper and it would be clearer to explicitly query the GV alignment

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h
47 ↗(On Diff #348803)

This should handle one GV at a time. I don't see a benefit to pushing the loop into the function

hsmhsm updated this revision to Diff 349760.Jun 3 2021, 9:02 PM

Fixed review comments by Matt.

This revision was landed with ongoing or failed builds.Jun 3 2021, 9:07 PM
This revision was automatically updated to reflect the committed changes.

If we're going with this anyway, let's be explicit in the title and commit message that we are over-aligning data. We aren't 'fixing' things, or doing them 'properly', we are intentionally burning memory in the hope of improving runtime.

I expect overrule-the-developer-intent transforms to generate bug reports. We could head those off by including a command line option to disable the overalignment.

hsmhsm reopened this revision.Jun 4 2021, 12:52 AM
This revision is now accepted and ready to land.Jun 4 2021, 12:52 AM
hsmhsm updated this revision to Diff 349782.Jun 4 2021, 1:00 AM

The earlier commit had to be reverted since Align Alignment(GV->getAlignment()); was causing assert when GV->getAlignment() returns 0.

hsmhsm added a comment.Jun 4 2021, 1:30 AM

If we're going with this anyway, let's be explicit in the title and commit message that we are over-aligning data. We aren't 'fixing' things, or doing them 'properly', we are intentionally burning memory in the hope of improving runtime.

I expect overrule-the-developer-intent transforms to generate bug reports. We could head those off by including a command line option to disable the overalignment.

I agree to the fact that we might in some cases going against the intent of the programmer (but it is safe), but do not agree to the fact that we are buring memory here. W.r.t un-optimal usage of padding memory, it is not because of this patch, it is actually because of sorting logic. We are sorting using alignment as a primary key, but, we should actually sort using size as primary key.

If you think that this patch is burning memory, then just think about how much memory, this original ModuleLDSLowering pass itself is burning.

hsmhsm updated this revision to Diff 350108.Jun 6 2021, 8:35 AM

Rebased.

Hi @JonChesterfield

We believe, this patch is harmless and does not cause any potential problems. As I already mentioned in my previous comment, the un-optimal padding memory is largely due to sorting logic (we should use size as primary key instead of alignment). Morever, we think, this patch is very important in our attempt to solve perf issues due to unliagned access of LDS. So, I am going to submit this patch since it is accepted.

I note that you did not change the commit message to reflect the contents of the patch as requested.

hsmhsm added a comment.Jun 7 2021, 6:02 AM

I note that you did not change the commit message to reflect the contents of the patch as requested.

The commit message says, what the the patch is doing - it increases the alignment of LDS if it is under-aligned in order to reduce the probability of unlinged access. So, I do not think, any change to commit message is required.

I note that you did not change the commit message to reflect the contents of the patch as requested.

The commit message says, what the the patch is doing - it increases the alignment of LDS if it is under-aligned in order to reduce the probability of unlinged access. So, I do not think, any change to commit message is required.

The commit message, as landed, is:

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.

Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.

Increasing the alignment is not necessary. Variables with less than natural alignment are not 'improper'. The commit message is inaccurate, as identified above.

This is as best an optimisation. It's also inconsistent with the other work on LDS that tries to minimise usage, in that this patch deliberately wastes some to aid instruction selection. As it is one that explicitly ignores programmer annotations we should not be surprised to see it return to us in a bug report.

I hoped to avoid further antagonising the (hopefully hypothetical) developer who finds this at bottom of a git bisect by not claiming it is necessary, and preferably by providing a command line hook to disable it.

hsmhsm added a comment.Jun 7 2021, 6:23 AM

I note that you did not change the commit message to reflect the contents of the patch as requested.

The commit message says, what the the patch is doing - it increases the alignment of LDS if it is under-aligned in order to reduce the probability of unlinged access. So, I do not think, any change to commit message is required.

The commit message, as landed, is:

[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.

Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.

Increasing the alignment is not necessary. Variables with less than natural alignment are not 'improper'. The commit message is inaccurate, as identified above.

This is as best an optimisation. It's also inconsistent with the other work on LDS that tries to minimise usage, in that this patch deliberately wastes some to aid instruction selection. As it is one that explicitly ignores programmer annotations we should not be surprised to see it return to us in a bug report.

I hoped to avoid further antagonising the (hopefully hypothetical) developer who finds this at bottom of a git bisect by not claiming it is necessary, and preferably by providing a command line hook to disable it.

I probably would think that the only phrase which is bit mis-leading is "if necessary" part in the message. Instead, it would have been like this - "Increase alignment of LDS globals if it is not on natural align boundary, before LDS lowering". But, I pesonally, it is okay to live with it.
W.r.t command line switch, if you expect one, we can still add it in a new patch.

Yes, I see that you agree with the wording you committed. We can't fix it now.

this patch is harmless and does not cause any potential problems

This patch increases LDS usage. Increasing LDS usage can take us over an occupancy threshold, where the last threshold converts a program that runs to one that doesn't. Using more LDS here also prevents it from being used by promote alloca, where it may have made a bigger performance improvement.

The code change itself is probably OK, my objection is to characterising a transform that ignores user annotation and can improve, degrade, or break programs as 'necessary' or 'proper', and that you stuck with that description even after attention was drawn to it.

hsmhsm added a comment.Jun 7 2021, 6:53 AM

Yes, I see that you agree with the wording you committed. We can't fix it now.

this patch is harmless and does not cause any potential problems

This patch increases LDS usage. Increasing LDS usage can take us over an occupancy threshold, where the last threshold converts a program that runs to one that doesn't. Using more LDS here also prevents it from being used by promote alloca, where it may have made a bigger performance improvement.

The code change itself is probably OK, my objection is to characterising a transform that ignores user annotation and can improve, degrade, or break programs as 'necessary' or 'proper', and that you stuck with that description even after attention was drawn to it.

As you suggested, and as I agreed to it, let's add a command line switch, make it "on" by default, and let user turn it off, if he/she does not want this transformation.

foad added inline comments.Jun 7 2021, 6:58 AM
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
178

This should really be using getAlign. getAlignment has a FIXME comment saying it will be removed.

You should use the Align or MaybeAlign types throughout instead of unsigned.

181

Why continue here? Surely it's better to fall through into the code that increases alignment?

hsmhsm marked 2 inline comments as done.Jun 7 2021, 7:19 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
178

I was using getAlign only, but l had to change it based on one of review comments, I think. Anyway, I will fix it in a new patch that I am planning to submit soon, to introduce a command line flag for this transformation.

181

Will fix it in new patch.