Page MenuHomePhabricator

[openmp] Fix 51647, corrupt bitcode on amdgpu
ClosedPublic

Authored by JonChesterfield on Sep 9 2021, 4:17 AM.

Details

Summary

Patch by @dpalermo

The corrupt bitcode reported in https://bugs.llvm.org/show_bug.cgi?id=51647 seems to be a result of a later pass changing the workfn variable to addrspace(5) (thread private, on the stack). That seems reasonable for an alloca without an address space so it's an open question why that can crash the bitcode reader.

This change puts it in the thread private address space to begin with which means whatever misfired further down the pipeline does not break it. That matches the codegen from clang where stack variables are always annotated (5) and then addrspace cast prior to following use.

This therefore patches around whatever unsuccessfully moved the alloca variable to addrspace(5). That solves the problem of openmp opt producing code that crashes the bitcode reader. It should be possible to create a minimal repro for the underlying bug based on some handwritten IR that uses an alloca in a generic address space.

Diff Detail

Event Timeline

JonChesterfield created this revision.Sep 9 2021, 4:17 AM
JonChesterfield requested review of this revision.Sep 9 2021, 4:17 AM
Herald added a project: Restricted Project. · View Herald Transcript
JonChesterfield edited the summary of this revision. (Show Details)Sep 9 2021, 4:38 AM
JonChesterfield edited the summary of this revision. (Show Details)
JonChesterfield retitled this revision from [openmp] Provisional fix for 51647, not yet tested on nvptx to [openmp] Fix 51647, corrupt bitcode on amdgpu.Sep 9 2021, 5:01 AM
  • format, rename variable

Built & run on gfx9/gfx10/sm_50/sm_70 which suggests this hasn't broken anything. Haven't seen the corrupt bitcode since, though can't say with certainty that is fixed by this patch. @dpalermo would you like to commandeer this diff? Think I've got the grunt work out of the way.

dpalermo-phab accepted this revision.Sep 9 2021, 6:30 AM
dpalermo-phab added a subscriber: dpalermo-phab.

LGTM

This revision is now accepted and ready to land.Sep 9 2021, 6:30 AM
ronlieb accepted this revision.Sep 9 2021, 6:32 AM

LGTM

This one has been deeply annoying to debug and turned up in some customer bug reports so we'd really like to add it to the llvm-13 release branch.

Remove the target triple from
llvm/test/Transforms/OpenMP/custom_state_machines.ll
and run it once with --mtriple set to amd and once to ptx.
The check prefixes should be --check-prefixes=ALL,AMDGPU and --check-prefixes=ALL,NVPTX respectively

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
3466

not address space local. DataLayout::getAllocaAddrSpace

3491

if (WorkFnAI->getType()->getAS() != Generic) WorkFnAI = new ASCast...

@dpalermo is taking a look at the above comment by @jdoerfert

dpalermo commandeered this revision.Sep 9 2021, 9:06 AM
dpalermo edited reviewers, added: JonChesterfield; removed: dpalermo.
dpalermo updated this revision to Diff 371686.Sep 9 2021, 12:11 PM
  • Updates to code per review (will update test next)
dpalermo marked 2 inline comments as done.Sep 9 2021, 12:14 PM

The whitespace looks a bit off to me but the phab clang-format bot is apparently down. Please git-clang-format HEAD^ the diff

dpalermo updated this revision to Diff 371701.Sep 9 2021, 1:05 PM
  • ran: git-clang-format HEAD^
dpalermo updated this revision to Diff 371729.Sep 9 2021, 3:03 PM
  • update test to pass -mtriple and run for both amdgcn & nvptx
dpalermo marked an inline comment as not done.Sep 9 2021, 3:04 PM

While updating the lit test as requested (add -mtriple and run for both amdgcn & nvptx), I found that the alloca addrspace(5) and cast to generic was no longer appearing. I backed out the change to use DataLayout::getAllocaAddrSpace...and then the modified lit test started passing. Now investigating what is happening when using DataLayout::getAllocaAddrSpace.

Potentially interesting that getAllocaAddrSpace returns a different value to local (as that's presumably why they're behaving differently)

Potentially interesting that getAllocaAddrSpace returns a different value to local (as that's presumably why they're behaving differently)

Exactly

Also, as requested, I tried adding --check-prefixes=ALL,AMDGPU and --check-prefixes=ALL,NVPTX to the FileCheck in the test, but that caused lit to report an error that the prefixes weren't found:

$ build/bin/llvm-lit -v llvm/test/Transforms/OpenMP/custom_state_machines.ll
...
error: no check strings found with prefixes 'ALL:', 'AMDGPU:'
error: no check strings found with prefixes 'ALL:', 'NVPTX:'

So I didn't include that change here.

The first version of this fixes a few errors in OvO for me.

I think the filecheck suggestion was to allow matching different code for amdgpu as opposed to nvptx. So it's only meaningful if there are some CHECK lines that are different for each, where CHECK: has been replaced with a different string

dpalermo updated this revision to Diff 371757.Sep 9 2021, 5:48 PM
  • reenable DataLayout::getAllocaAddrSpace
dpalermo marked an inline comment as done.Sep 9 2021, 5:48 PM

I've reenabled the use of DataLayout::getAllocaAddrSpace as that does indeed still work as intended in the larger runnable test cases that hit this problem...and is the right thing to do since not all target architectures would necessarily have alloca in the local address space.

This means the lit test is once again broken. I've determined that the problem with the lit test is that:

  • there is no datalayout specified so we get a default that doesn't put alloca in the local address space (so it doesn't add the cast)...and the test "fails"
  • I tried just adding a datalayout, but statements end up moving around causing all sorts of differences (checking is currently very tight)....and the test fails

So the options I can think of are:

  1. just update the lit test without the addrcast so it passes, but it isn't testing what is being done in this change
  2. regenerate the test case & checkers using a datalayout that has alloca as local
  3. or do #1 and add another test that exercises this change (e.g. using the reduced test from the JIRA)

Let me know what your preference is.

JonChesterfield commandeered this revision.Sep 10 2021, 3:33 AM
JonChesterfield edited reviewers, added: dpalermo; removed: JonChesterfield.

Dan tells me he is no longer available to work on this, at least for the near future.

Datalayout seems necessary if we're going to key codegen off it. Amdgpu and nvptx use different datalayouts and I'm reluctant to duplicate the test so will try adding a simple datalayout that makes adequate sense for either, regenerate the tests based off that as one diff, then recreate this patch on top.

I may not be able to get to this today.

Datalayout seems necessary if we're going to key codegen off it. Amdgpu and nvptx use different datalayouts and I'm reluctant to duplicate the test so will try adding a simple datalayout that makes adequate sense for either, regenerate the tests based off that as one diff, then recreate this patch on top.

You can specify the target and the datalayout in the run command, no need to duplicate things: --data-layout=

  • Set check-prefixes on custom_state_machines, run update_test_checks, advice sought on making that work
llvm/test/Transforms/OpenMP/custom_state_machines.ll
1–2

Changed the opt invocations here. Now have one opt for mtriple=amdgpu and one for nvptx, with CHECK-PREFIXES set.

Then ran (exactly):
cd $HOME/llvm-project/llvm
./utils/update_test_checks.py test/Transforms/OpenMP/custom_state_machines.ll --opt-binary=$HOME/llvm-install/bin/opt

That rewrite custom_state_machines to what is uploaded here, which seems unlikely to be what is desired. It looks like a complete copy of the input duplicated for each permutation of check-prefixes, plus all the CHECK-NEXT from before which is now dead.

What am I supposed to do to update this test?

  • manually removed check lines and reran script
  • custom state machines test now looks right
  • spmdization now good
jdoerfert accepted this revision.Sep 13 2021, 7:03 AM

LG, assuming the tests pass

  • Drop triple statements

Yep, tests passing. I'm happy with this too. Will land it shortly

This revision was landed with ongoing or failed builds.Sep 13 2021, 7:25 AM
This revision was automatically updated to reflect the committed changes.

Note to self about the test updating process.

The utils/update_test_checks.py script reads the ; RUN commands and emits new check statements based on the --check-prefixes string. When changing the prefix it doesn't remove the old ones, but the test format is predictable enough that grep -v '; CHECK' does the right thing.

Process for updating tests was to duplicate the run line, giving different triple and prefix for amdgpu and nvptx. Also added --data-layout=A5 for alloca in addrspace 5 for amdgpu, copied from some other tests that don't set the whole datalayout. Then delete all the existing CHECK prefixes. Then run ./utils/update_test_checks.py test/Transforms/OpenMP/spmdization.ll --opt-binary=$HOME/llvm-install/bin/opt as that's where I install opt locally.

JonChesterfield added a comment.EditedSep 13 2021, 8:48 AM

Having trouble applying this to llvm-13 release, 'opt: Unknown command line argument '-openmp-opt-disable-state-machine-rewrite'

@jdoerfert do we need to apply some state machine changes to llvm-13?

diff vs 13 looks like

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --function-signature --check-attributes --include-generated-funcs
-; RUN: opt -S -passes=openmp-opt < %s | FileCheck %s
+; RUN: opt --mtriple=amdgcn-amd-amdhsa --data-layout=A5 -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=AMDGPU
+; RUN: opt --mtriple=nvptx64--         -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=NVPTX
+; RUN: opt --mtriple=amdgcn-amd-amdhsa --data-layout=A5 -openmp-opt-disable-state-machine-rewrite -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=AMDGPU-DISABLED
+; RUN: opt --mtriple=nvptx64--         -openmp-opt-disable-state-machine-rewrite -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=NVPTX-DISABLED

so I expect I can apply the same change manually, without the --disabled test lines

Having trouble applying this to llvm-13 release, 'opt: Unknown command line argument '-openmp-opt-disable-state-machine-rewrite'

@jdoerfert do we need to apply some state machine changes to llvm-13?

diff vs 13 looks like

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --function-signature --check-attributes --include-generated-funcs
-; RUN: opt -S -passes=openmp-opt < %s | FileCheck %s
+; RUN: opt --mtriple=amdgcn-amd-amdhsa --data-layout=A5 -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=AMDGPU
+; RUN: opt --mtriple=nvptx64--         -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=NVPTX
+; RUN: opt --mtriple=amdgcn-amd-amdhsa --data-layout=A5 -openmp-opt-disable-state-machine-rewrite -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=AMDGPU-DISABLED
+; RUN: opt --mtriple=nvptx64--         -openmp-opt-disable-state-machine-rewrite -S -passes=openmp-opt < %s | FileCheck %s --check-prefixes=NVPTX-DISABLED

so I expect I can apply the same change manually, without the --disabled test lines

Yes, remove the run lines for disable for back porting.

Worked around by splitting the patch to trunk into code (71052ea1e3c63b7209731fdc1726d10640d97480) and test commits (6775ad2025fc74c76fc440efb1de98de2179b6bc) and posting bug https://bugs.llvm.org/show_bug.cgi?id=51838 to apply the functional change as that applies cleanly. Hopefully I (or @dpalermo?) will be able to create a patch or branch for the manually patched up test change against llvm-13.