This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Disable bool range metadata to workaround backend issue
ClosedPublic

Authored by yaxunl on Oct 5 2022, 7:55 AM.

Details

Summary

Currently there is a middle-end or backend issue
https://github.com/llvm/llvm-project/issues/58176
which causes values loaded from bool pointer incorrect when
bool range metadata is emitted. Temporarily
disable bool range metadata until the backend issue
is fixed.

Diff Detail

Event Timeline

yaxunl created this revision.Oct 5 2022, 7:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 5 2022, 7:55 AM
Herald added subscribers: kosarev, t-tye, tpr and 2 others. · View Herald Transcript
yaxunl requested review of this revision.Oct 5 2022, 7:55 AM
yaxunl updated this revision to Diff 465403.Oct 5 2022, 8:02 AM

fix comments in test

tra added a comment.Oct 5 2022, 10:18 AM

Is there more info about the issue? What does AMDGPU currently emit for the test case?

AFAICT from running it on CE (https://godbolt.org/z/ccq3vnbrM) llvm optimizes it to essentially *y = *x and generates a 1-byte load+store for both NVPTX and AMDGPU.

yaxunl added a comment.Oct 5 2022, 3:40 PM

Is there more info about the issue? What does AMDGPU currently emit for the test case?

AFAICT from running it on CE (https://godbolt.org/z/ccq3vnbrM) llvm optimizes it to essentially *y = *x and generates a 1-byte load+store for both NVPTX and AMDGPU.

The issue happens to more complicated test cases which I cannot reduce right now.

Basically 8018d6be3459780e81a5da128a9915eb27909902 caused regressions in some PyTorch tests. Investigation shows the propagation of range metadata for bool type triggered some optimizations which caused some bool values to be loaded incorrectly. I will continue investigating the issue. However, I need a workaround for now.

tra added inline comments.Oct 5 2022, 3:47 PM
clang/lib/CodeGen/CGExpr.cpp
1751–1754

It would be great to open a github issue, if we don't have one yet, and reference it here, so we can tell later what exactly it is we're working around here and know for sure when/whether we can undo the change.

clang/test/CodeGenCUDA/bool-range.cu
15

Ditto, a bug reference would help.

yaxunl marked an inline comment as done.Oct 5 2022, 5:37 PM
yaxunl added inline comments.
clang/lib/CodeGen/CGExpr.cpp
1751–1754

opened github issue:

https://github.com/llvm/llvm-project/issues/58176

will update the comments

yaxunl updated this revision to Diff 465731.Oct 6 2022, 7:41 AM
yaxunl marked an inline comment as done.
yaxunl edited the summary of this revision. (Show Details)

update comments with issue link

tra accepted this revision.Oct 6 2022, 10:43 AM
This revision is now accepted and ready to land.Oct 6 2022, 10:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2022, 7:46 AM

Checking back here, have you made any progress on reducing the issue?

cc @arsenm for awareness

Checking back here, have you made any progress on reducing the issue?

cc @arsenm for awareness

No. I am busy with other work and have not got time to get back on it.

nikic added a comment.Dec 8 2022, 7:58 AM

Checking back here again on whether there is any progress on finding the root cause of the issue. If no progress is expected in the near future, I'd ask for this patch to be reverted.

yaxunl added a subscriber: jrbyrnes.Dec 8 2022, 9:22 AM

Checking back here again on whether there is any progress on finding the root cause of the issue. If no progress is expected in the near future, I'd ask for this patch to be reverted.

@jrbyrnes is working on the root cause of this issue. Any updates? Thanks.

Checking back here again on whether there is any progress on finding the root cause of the issue. If no progress is expected in the near future, I'd ask for this patch to be reverted.

@jrbyrnes is working on the root cause of this issue. Any updates? Thanks.

Thanks for the ping. I would also like to see this reverted as it enables some optimizations. I do not have a definitive answer at the moment (w.r.t reverting this), but hope to provide one soon

As for now, the issue we are seeing from (https://github.com/llvm/llvm-project/commit/8018d6be3459780e81a5da128a9915eb27909902) seems most likely to be a source code issue (first document of issue https://github.com/pytorch/pytorch/issues/54789 . upstream PyTorch currently skips problematic test https://github.com/pytorch/pytorch/blob/b738da8c8e4d9142ad38a1bd8c35d0bfef4b5e3c/torch/testing/_internal/common_methods_invocations.py#L14891) . I will provide a better update soon.

Checking back here again on whether there is any progress on finding the root cause of the issue. If no progress is expected in the near future, I'd ask for this patch to be reverted.

@jrbyrnes is working on the root cause of this issue. Any updates? Thanks.

Thanks for the ping. I would also like to see this reverted as it enables some optimizations. I do not have a definitive answer at the moment (w.r.t reverting this), but hope to provide one soon

As for now, the issue we are seeing from (https://github.com/llvm/llvm-project/commit/8018d6be3459780e81a5da128a9915eb27909902) seems most likely to be a source code issue (first document of issue https://github.com/pytorch/pytorch/issues/54789 . upstream PyTorch currently skips problematic test https://github.com/pytorch/pytorch/blob/b738da8c8e4d9142ad38a1bd8c35d0bfef4b5e3c/torch/testing/_internal/common_methods_invocations.py#L14891) . I will provide a better update soon.

I will revert this patch. Thanks.