This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX][FIX] Expand atomics we cannot handle natively in the ISA
Needs ReviewPublic

Authored by jdoerfert on Dec 6 2019, 9:47 AM.

Details

Summary
NOTE: This is lacking a test and more of a request for feedback (I'm not an NVPTX person).

See also PR4219.

Diff Detail

Event Timeline

jdoerfert created this revision.Dec 6 2019, 9:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2019, 9:47 AM
jfb added a reviewer: __simt__.Dec 6 2019, 9:54 AM
arsenm added a comment.Dec 6 2019, 9:57 AM

Needs tests. The AMDGPU ones in test/Transforms/AtomicExpand can probably be copied as-is (plus another codegen one to make sure AtomicExpand is actually running)

Build result: pass - 60562 tests passed, 0 failed and 726 were skipped.

Log files: console-log.txt, CMakeCache.txt

tra added a comment.Dec 6 2019, 10:05 AM

+1 for tests.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
360
jdoerfert updated this revision to Diff 232633.Dec 6 2019, 12:53 PM
jdoerfert marked an inline comment as done.

Add tests and run pass in the pipeline

Fixed the typo, copied test/Transforms/AtomicExpand/AMDGPU into NVPTX and changed the run lines accordingly. Then I run the update_test_checks. The result is different than before (some expansion happens), and close to the AMDGPU result, but I haven't verified everything.

I also added the pass explicitly to the NVPTX required passes. Are there existing tests to check the target specific pipeline?

Build result: fail - 60568 tests passed, 3 failed and 726 were skipped.

failed: LLVM.CodeGen/NVPTX/atomics-sm60.ll
failed: LLVM.CodeGen/NVPTX/atomics.ll
failed: LLVM.CodeGen/NVPTX/load-store.ll

Log files: console-log.txt, CMakeCache.txt

tra added a comment.EditedDec 6 2019, 1:35 PM
> Command Output (stderr):
> --
> /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/llvm/test/CodeGen/NVPTX/atomics-sm60.ll:6:10: error: CHECK: expected string not found in input
> ; CHECK: atom.add.f64
>          ^
> <stdin>:1:1: note: scanning from here
> //
> ^
> <stdin>:34:2: note: possible intended match here
>  atom.cas.b64 %rd3, [%r1], %rd2, %rd1;
>  ^

This appears to be a regression. We do have fp32/fp64 atomic adds in NVPTX. Replacing them with add+CAS is suboptimal.

In D71128#1773447, @tra wrote:
> Command Output (stderr):
> --
> /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/llvm/test/CodeGen/NVPTX/atomics-sm60.ll:6:10: error: CHECK: expected string not found in input
> ; CHECK: atom.add.f64
>          ^
> <stdin>:1:1: note: scanning from here
> //
> ^
> <stdin>:34:2: note: possible intended match here
>  atom.cas.b64 %rd3, [%r1], %rd2, %rd1;
>  ^

This appears to be a regression. We do have fp32/fp64 atomic adds in NVPTX. Replacing them with add+CAS is suboptimal.

I'm already working on it.

jdoerfert updated this revision to Diff 232662.Dec 6 2019, 3:17 PM

Fix test cases by exposing more TLI hooks

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

Build result: FAILURE -
Log files: console-log.txt, CMakeCache.txt

I have the feeling that wasn't my fault.

tra added inline comments.Dec 9 2019, 11:57 AM
llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll
16

Don't we want to preserve atomicrmw fadd in this case and lower it to atom.add.f32 ? Why do we want to expand here?

131

Ditto here and below. We do have atom.add.f64

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
5

Functilon name fadd does not seem to match the instruction fsub.

15

I must be missing something -- I would think that we do *not* want to expand atomicrmw variants which we can lower to an existing instruction, but a lot of the tests show the opposite and expand atomics that have direct support in hardware. The patch subject seems to agree with my assumptions, but the tests appear to contradict it. Is that intentional? If so, what is it that I'm missing?

llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll
2

Nit: no need for -check-prefix as you only using CHECK in the test.

jdoerfert marked 5 inline comments as done.Dec 9 2019, 1:33 PM

I will try to look into the problematic lowerings @tra pointed out (thanks btw!). Any hints to why they are expanded are appreciated :)

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fadd.ll
16

Same as below.

131

To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits wrt. size and alignment. Somehow that had more effect than I wanted. The new hooks already remove some of the weirdness we saw but it seems something is missing here (maybe during the instruction "registration").

llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
5

Good catch, copy & paste from the AMD tests ;) (@arsenm

15

It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a backend nor NVPTX person and my changes do seem to have unwanted effects I cannot even categorize.

llvm/test/Transforms/AtomicExpand/NVPTX/unaligned-atomic.ll
2

Fair, I think I copied this ;)

arsenm added inline comments.Dec 10 2019, 8:20 AM
llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
15

For the purpose of this change, that this isn't optimal doesn't matter. These aren't implemented correct, but doing so is a separate change and those changes will show up in the same tests here

tra added inline comments.Dec 10 2019, 9:05 AM
llvm/test/Transforms/AtomicExpand/NVPTX/expand-atomic-rmw-fsub.ll
15

OK. Looks like atomicrmw fsub currently fails to lower on NVPTX, so expanding it is an improvement. However, expanding atomicrmw fadd is a substantial regression and is likely to be a showstopper. Atomic FP32 addition is a commonly used instruction in various reduction kernels so anything that prevents mapping it to atom.add.f32 instruction will be very noticeable.

I realize that there are many moving parts involved in getting this to work properly. If proper fix needs multiple patches, please try to commit them atomically to avoid the performance regression in between those changes.

Also, if there are dependent patches, it would be great to arrange all of them as such in phabricator, so it's easier to see the big picture.

arsenm added inline comments.Jan 9 2020, 7:15 AM
llvm/include/llvm/CodeGen/TargetLowering.h
1856–1871

Do we really need 4 of these when just the one for Instruction will work

jdoerfert marked an inline comment as done.Jan 9 2020, 8:32 AM
jdoerfert added inline comments.
llvm/include/llvm/CodeGen/TargetLowering.h
1856–1871

Alternatively we can overload the instruction one and check for the kind to decide what to do. I don't remember how I ended up like this, I'll address this once I get around to this patch again...

arsenm resigned from this revision.Feb 13 2020, 4:44 PM