See also PR4219.
Fixed the typo, copied test/Transforms/AtomicExpand/AMDGPU into NVPTX and changed the run lines accordingly. Then I run the update_test_checks. The result is different than before (some expansion happens), and close to the AMDGPU result, but I haven't verified everything.
I also added the pass explicitly to the NVPTX required passes. Are there existing tests to check the target specific pipeline?
> Command Output (stderr): > -- > /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang8/llvm/test/CodeGen/NVPTX/atomics-sm60.ll:6:10: error: CHECK: expected string not found in input > ; CHECK: atom.add.f64 > ^ > <stdin>:1:1: note: scanning from here > // > ^ > <stdin>:34:2: note: possible intended match here > atom.cas.b64 %rd3, [%r1], %rd2, %rd1; > ^
This appears to be a regression. We do have fp32/fp64 atomic adds in NVPTX. Replacing them with add+CAS is suboptimal.
Don't we want to preserve atomicrmw fadd in this case and lower it to atom.add.f32 ? Why do we want to expand here?
Ditto here and below. We do have atom.add.f64
Functilon name fadd does not seem to match the instruction fsub.
I must be missing something -- I would think that we do *not* want to expand atomicrmw variants which we can lower to an existing instruction, but a lot of the tests show the opposite and expand atomics that have direct support in hardware. The patch subject seems to agree with my assumptions, but the tests appear to contradict it. Is that intentional? If so, what is it that I'm missing?
Nit: no need for -check-prefix as you only using CHECK in the test.
I will try to look into the problematic lowerings @tra pointed out (thanks btw!). Any hints to why they are expanded are appreciated :)
Same as below.
To be honest, I don't even know why we do not match it. All I (tried) to do is add the limits wrt. size and alignment. Somehow that had more effect than I wanted. The new hooks already remove some of the weirdness we saw but it seems something is missing here (maybe during the instruction "registration").
Good catch, copy & paste from the AMD tests ;) (@arsenm
It is not intentional to pesimise anything, as mentioned above. The problem is I am neither a backend nor NVPTX person and my changes do seem to have unwanted effects I cannot even categorize.
Fair, I think I copied this ;)
For the purpose of this change, that this isn't optimal doesn't matter. These aren't implemented correct, but doing so is a separate change and those changes will show up in the same tests here
OK. Looks like atomicrmw fsub currently fails to lower on NVPTX, so expanding it is an improvement. However, expanding atomicrmw fadd is a substantial regression and is likely to be a showstopper. Atomic FP32 addition is a commonly used instruction in various reduction kernels so anything that prevents mapping it to atom.add.f32 instruction will be very noticeable.
I realize that there are many moving parts involved in getting this to work properly. If proper fix needs multiple patches, please try to commit them atomically to avoid the performance regression in between those changes.
Also, if there are dependent patches, it would be great to arrange all of them as such in phabricator, so it's easier to see the big picture.
Alternatively we can overload the instruction one and check for the kind to decide what to do. I don't remember how I ended up like this, I'll address this once I get around to this patch again...