This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Do not second-guess alignment for alloca
ClosedPublic

Authored by asavonic on Oct 7 2022, 10:14 AM.

Details

Summary

Alignment of an alloca in IR can be lower than the preferred alignment
on purpose, but this override essentially treats the preferred
alignment as the minimum alignment.

The patch changes this behavior to always use the specified
alignment. If alignment is not set explicitly in LLVM IR, it is set to
DL.getPrefTypeAlign(Ty) in computeAllocaDefaultAlign.

Tests are changed as well: explicit alignment is increased to match
the preferred alignment if it changes output, or omitted when it is
hard to determine the right value (e.g. for pointers, some structs, or
weird types).

Diff Detail

Event Timeline

asavonic created this revision.Oct 7 2022, 10:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2022, 10:14 AM
asavonic requested review of this revision.Oct 7 2022, 10:14 AM

A similar issue was discussed and fixed in D79532, where promotion of alignment caused stack realignment.
This patch now removes promotion of alignment, so we always use the specified alignment.

Given we're de-emphasizing the type of allocas, this probably makes sense? We should be careful that this doesn't have unexpected effects, though; decreasing the alignment of an alloca could affect code generation. (For example, see D134282.)

llvm/lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp
130–131

Outdated comment?

llvm/test/CodeGen/AArch64/preferred-alignment.ll
10

Please explicitly specify alignment where possible.

I though the preferred alignment of i8 is 1? I can't see how this change has any practical effect.

asavonic updated this revision to Diff 466460.Oct 10 2022, 3:29 AM
  • Removed an outdated comment
  • Added another testcase to AArch64/preferred-alignment.ll

Given we're de-emphasizing the type of allocas, this probably makes sense? We should be careful that this doesn't have unexpected effects, though; decreasing the alignment of an alloca could affect code generation. (For example, see D134282.)

Yes, this can cause some issues (as indicated by test changes), because alignment on alloca suddenly becomes more important.
Unfortunately, I don't see any other way how we can support alignment that is lower than the preferred alignment.

llvm/test/CodeGen/AArch64/preferred-alignment.ll
10

Apparently all 3 allocas have preferred alignment of 4.

It sound like the test is supposed to verify the preferred alignment. If we keep explicit alignment, it will take precedence over the preferred alignment.

So I suggest to have two tests: one where alignment is implicit, and another where alignment is set to whatever value we expect the preferred alignment to be.

nikic added a subscriber: nikic.Oct 10 2022, 3:50 AM

This makes sense to me. InstCombine will still raise the alignment based on load/stores so long as it does not require stack realignment.

asavonic updated this revision to Diff 482160.Dec 12 2022, 9:18 AM
  • Rebased.

@efriedma, can you please check this patch again?

This revision is now accepted and ready to land.Dec 12 2022, 9:28 AM
This revision was landed with ongoing or failed builds.Dec 15 2022, 7:18 AM
This revision was automatically updated to reflect the committed changes.

hello , this commit is breaking our buildbot : could you address please ? https://lab.llvm.org/buildbot/#/builders/193

hello , this commit is breaking our buildbot : could you address please ? https://lab.llvm.org/buildbot/#/builders/193

It sounds like the frontend does not set the right alignment in IR:

Libomptarget message: explicit extension not allowed: host address specified is 0x00007fff01df78a0 (12 bytes), but device allocation maps to host at 0x00007fff01df78a4 (8 bytes)

@ronlieb, do you know who can help with this on OpenMP side?

asavonic reopened this revision.Dec 15 2022, 11:15 AM
This revision is now accepted and ready to land.Dec 15 2022, 11:15 AM

@jhuber6 please help with this one ? thanks

arsenm added a subscriber: arsenm.Dec 15 2022, 11:22 AM

Thanks, I've wanted to fix this for a long time. I have alignment decreasing optimizations I would like to perform

lkail added a subscriber: lkail.Dec 16 2022, 7:17 AM

I checked OpenMP tests, and the difference is pretty much what was expected. This is what we have as an input (the patch does not change the IR):

// x86_64-pc-linux-gnu :: mapping/ompx_hold/struct.c
struct S {
  int i;
  int j;
} s;
// CHECK: presence of s, s.i, s.j: 0, 0, 0
CHECK_PRESENCE(s, s.i, s.j);

IR:

%s = alloca %struct.S, align 4
%call = call i32 @omp_get_default_device()
%call1 = call i32 @omp_target_is_present(ptr noundef %s, i32 noundef %call)

The assembly is different: we keep alignment of 4 and do not increase it to the preferred value of 8.
Before the patch:

leaq    -16(%rbp), %rdi
callq   omp_target_is_present@PLT

After the patch:

leaq    -12(%rbp), %rdi
callq   omp_target_is_present@PLT

Libomptarget message: explicit extension not allowed: host address specified is 0x00007fff03109220 (12 bytes), but device allocation maps to host at 0x00007fff03109224 (8 bytes)

If I explicitly set __attribute__((aligned(8))) to the declaration of s, assembly before and after the patch becomes equivalent, so the problem should go away. Apparently there is some discrepancy in alignment that the frontend (or the OpenMP runtime) expects, but it is not obvious where we need to fix this.

jhuber6 added a subscriber: jdoerfert.EditedDec 19 2022, 1:45 PM

I checked OpenMP tests, and the difference is pretty much what was expected. This is what we have as an input (the patch does not change the IR):

// x86_64-pc-linux-gnu :: mapping/ompx_hold/struct.c
struct S {
  int i;
  int j;
} s;
// CHECK: presence of s, s.i, s.j: 0, 0, 0
CHECK_PRESENCE(s, s.i, s.j);

IR:

%s = alloca %struct.S, align 4
%call = call i32 @omp_get_default_device()
%call1 = call i32 @omp_target_is_present(ptr noundef %s, i32 noundef %call)

The assembly is different: we keep alignment of 4 and do not increase it to the preferred value of 8.
Before the patch:

leaq    -16(%rbp), %rdi
callq   omp_target_is_present@PLT

After the patch:

leaq    -12(%rbp), %rdi
callq   omp_target_is_present@PLT

Libomptarget message: explicit extension not allowed: host address specified is 0x00007fff03109220 (12 bytes), but device allocation maps to host at 0x00007fff03109224 (8 bytes)

If I explicitly set __attribute__((aligned(8))) to the declaration of s, assembly before and after the patch becomes equivalent, so the problem should go away. Apparently there is some discrepancy in alignment that the frontend (or the OpenMP runtime) expects, but it is not obvious where we need to fix this.

The error you get comes from the associated pointer we find extending before or after where we expect it to. My guess is that we think the device-side address is also four bytes, but is actually eight causing the discrepancy. Device in this case being the x86_64 code we generated for offloading, you should be able to see both using flags like here https://godbolt.org/z/EGzq7Tef4. I'm not sure why OpenMP seems to have a different preferred alignment. Total spitballing, but all the arguments to a target kernel in OpenMP are just treated as uintptr_t, might be always rounding up to 8 on the device side and this patch makes us not respect that anymore.

It appears that the issue is caused by libomptarget mapping code that adds padding if the begin address does not have alignment of 8 (search for "alignment" in omptarget.cpp). The problem is that the padding is not applied consistently, so we may have a mapping without the padding (size == 8), and then get a request for a mapping with the padding (size == 12). Notice that we don't have Using a padding of 4 bytes for begin address for the first mapping, and we have it for the same exact pointer for the second mapping.

check: parent DynRefCount=1 is not sufficient for transfer
Libomptarget --> Entering data begin region for device -1 with 1 mappings
[...]
Libomptarget --> Entry  0: Base=0x00007ffc9b2581e4, Begin=0x00007ffc9b2581e4, Size=8, Type=0x2003, Name=unknown
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc9b2581e4, Size=8)...
Libomptarget --> Creating new map entry with HstPtrBase=0x00007ffc9b2581e4, HstPtrBegin=0x00007ffc9b2581e4, TgtPtrBegin=0x0000561d0739b390, Size=8, DynRefCount=0, HoldRefCount=1, Name=unknown
Libomptarget --> Moving 8 bytes (hst:0x00007ffc9b2581e4) -> (tgt:0x0000561d0739b390)
[...]
Libomptarget --> Entering target region for device -1 with entry point 0x0000561d059ca298
[...]
Libomptarget --> Entry  0: Base=0x00007ffc9b2581e4, Begin=0x00007ffc9b2581e4, Size=8, Type=0x20, Name=unknown
Libomptarget --> Entry  1: Base=0x00007ffc9b2581e4, Begin=0x00007ffc9b2581e4, Size=4, Type=0x1000000000002, Name=unknown
Libomptarget --> Entry  2: Base=0x00007ffc9b2581e4, Begin=0x00007ffc9b2581e8, Size=4, Type=0x1000000000002, Name=unknown
Libomptarget --> loop trip count is 0.
Libomptarget --> Using a padding of 4 bytes for begin address 0x00007ffc9b2581e4
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007ffc9b2581e0, Size=12)...
Libomptarget --> WARNING: Pointer is not mapped but section extends into already mapped data
Libomptarget message: explicit extension not allowed: host address specified is 0x00007ffc9b2581e0 (12 bytes), but device allocation maps to host at 0x00007ffc9b2581e4 (8 bytes)

I'm not sure how to fix this issue. The obvious solution is to pass the padding value to DeviceTy::getTargetPointer, and ignore it if we already have a mapping without it (since alignment requirements are already satisfied, right?). Another option is to apply padding more consistently, but I don't know enough about OpenMP to figure out what exactly is wrong with the current implementation.

@grokos, @Hahnfeld, do you have any suggestions how to fix this issue? The handling of alignment was added in D44186.

Hi @asavonic, after reading through the comments here and peeking into D44186, I think your analysis is correct: libomptarget assumes that it can pad the start address of every member to 8 bytes and stay within the requested size, which isn't true anymore. I'm not sure if it would be possible to "fix up" this mistake by increasing the padding requirements in the IR generated by the Clang frontend (do we control all allocas and can they also be passed in by other parts of code?). I guess the better approach would indeed be to correct the implementation in the runtime library. That said, I haven't looked into the OpenMP offloading code for quite some time, more active members include @jdoerfert @jhuber6 @ronlieb (all of which have been pinged already...)

From the outside, and remembering many discussions and changes around alignment of struct members, a viable approach may be to disable the padding code in libomptarget and see which cases actually break. I wouldn't be super surprised if some / most of the complexity isn't actually needed anymore and is dealt with (more correctly) in the frontend.

lkail added a comment.Feb 7 2023, 11:01 PM

Is there any progress with this patch? This patch can fix our downstream test failure which is caused by insistent alignment between IR and backend, thus affects alias analysis.

Is there any progress with this patch? This patch can fix our downstream test failure which is caused by insistent alignment between IR and backend, thus affects alias analysis.

Yes, the regression in OpenMP was fixed, and I plan to land this patch today.

This revision was automatically updated to reflect the committed changes.

The patch is merged (again). Please let me know if it causes any regressions for in-tree targets or tests.

alexfh added a subscriber: alexfh.Feb 14 2023, 5:40 AM

The patch is merged (again). Please let me know if it causes any regressions for in-tree targets or tests.

An early heads up: we started seeing some test failures after this commit, but I can't yet tell whether the commit or the code is at fault.

The patch is merged (again). Please let me know if it causes any regressions for in-tree targets or tests.

An early heads up: we started seeing some test failures after this commit, but I can't yet tell whether the commit or the code is at fault.

The only data point so far is that failures are happening on AArch64, which is alignment-sensitive.

bgraur added a subscriber: bgraur.Feb 14 2023, 6:33 AM

The patch is merged (again). Please let me know if it causes any regressions for in-tree targets or tests.

An early heads up: we started seeing some test failures after this commit, but I can't yet tell whether the commit or the code is at fault.

The only data point so far is that failures are happening on AArch64, which is alignment-sensitive.

False alarm. All the problems we've seen so far are due to UB in the code.

Hello this commit is causing a compiler crash for the following example:
https://llvm.godbolt.org/z/xajYWoa8K

Hello this commit is causing a compiler crash for the following example:
https://llvm.godbolt.org/z/xajYWoa8K

That seems to generate MIR where there are adjacent scaled and unscaled 32-bit stores of wzr, but AArch64LoadStoreOpt::mergeNarrowZeroStores looks pretty broken if you mix scaled-ness, as it determines IsScaled based on just one of the instructions and then uses that to determine what the immediate means. Either those need to be skipped over or the function needs to learn how to handle that mix.

I've bisected a broken test in libcxx for the mingw/x86_64 target down to this commit - see https://github.com/llvm/llvm-project/issues/64253 for a full bug report including a reduced reproducer.

I've bisected a broken test in libcxx for the mingw/x86_64 target down to this commit - see https://github.com/llvm/llvm-project/issues/64253 for a full bug report including a reduced reproducer.

Most likely exposing an issue elsewhere. All this patch does is change the alignment of allocas, so most likely there's some unrelated problem related to or exposed by the stack layout. (Or maybe some other optimization is making bad assumptions about alloca alignment, but that's unlikely given the way the APIs in question work.) Probably using opt-bisect-limit to bisect the exact optimization in question would help narrow down the issue.

I've bisected a broken test in libcxx for the mingw/x86_64 target down to this commit - see https://github.com/llvm/llvm-project/issues/64253 for a full bug report including a reduced reproducer.

Most likely exposing an issue elsewhere. All this patch does is change the alignment of allocas, so most likely there's some unrelated problem related to or exposed by the stack layout. (Or maybe some other optimization is making bad assumptions about alloca alignment, but that's unlikely given the way the APIs in question work.) Probably using opt-bisect-limit to bisect the exact optimization in question would help narrow down the issue.

Thanks, I'll try to look into that and see if I can pinpoint anything. The surprising thing here though, is that the error occurs in unoptimized builds, while when building with optimizations enabled, the code turns out to work just right.