Page MenuHomePhabricator


Authored by tstellarAMD on Feb 15 2016, 3:10 PM.



Implement BUFFER_ATOMIC_CMPSWAP{,_X2} instructions on all GCN targets, and FLAT_ATOMIC_CMPSWAP{,_X2} on CI+.

32-bit instruction variants tested manually on Kabini and Bonaire. Tests and parts of code provided by Jan Veselý.

Diff Detail


Event Timeline

rivanvx updated this revision to Diff 48018.Feb 15 2016, 3:10 PM
rivanvx retitled this revision from to Implement {BUFFER,FLAT}_ATOMIC_CMPSWAP{,_X2}.
rivanvx updated this object.
rivanvx added reviewers: nhaehnle, arsenm.

Please ignore changes in lib/Target/AMDGPU/, that's unrelated and was uploaded by accident. Will fix on reupload.

arsenm added inline comments.Feb 15 2016, 3:17 PM
191–194 ↗(On Diff #48018)

You should only need the one node. You need to remove the hardcoded vector type. What you need is SDTCisEltOfVec for the type constraint

394 ↗(On Diff #48018)

Not necessary, the type applied to the pattern is what matters

1990 ↗(On Diff #48018)

Space after if, and not before (AS)

2004–2012 ↗(On Diff #48018)

You can put VT directly into getVTList, you don't need the if

56–67 ↗(On Diff #48018)

I think this is exposing too much. the glc/slc bit behavior is what controls the atomic return behavior, so it makes no sense to expose. tfe requires changing the register class of the result, so it too should not be exposed.

The other addressing mode fields should also not be directly exposed.

67 ↗(On Diff #48018)

This definitely is not true

69 ↗(On Diff #48018)

New intrinsics should use the amdgcn prefix, not SI

arsenm added inline comments.Feb 15 2016, 3:20 PM
2004–2012 ↗(On Diff #48018)

You don't even need to construct a new vtlist, it should be the same as the incoming op's

arsenm edited edge metadata.Feb 15 2016, 4:38 PM

You could also try adding this to the set of handled memory nodes in performDAGCombine. This as is won't get that combine since currently it only runs on the legal dag, so the LDS atomic_cmp_swap won't appear anymore

rivanvx marked 5 inline comments as done.Feb 16 2016, 6:39 AM

Fixed all comments except intrinsics, will upload a new revision with tests and without intrinsics chages when I have tests ready.

rivanvx retitled this revision from Implement {BUFFER,FLAT}_ATOMIC_CMPSWAP{,_X2} to AMDGPU: Implement {BUFFER,FLAT}_ATOMIC_CMPSWAP{,_X2}.Feb 17 2016, 4:42 AM
rivanvx edited edge metadata.

Hi, any update?

Yes, I will post the updated patch per arsenm's suggestions tomorrow and one with tests over the weekend. I didn't find time to write the tests, sorry.

rivanvx updated this revision to Diff 49110.Feb 25 2016, 1:12 PM

Posting updated version.

arsenm added inline comments.Feb 25 2016, 1:26 PM
392 ↗(On Diff #49110)

You should only need this one

2003 ↗(On Diff #49110)

The wrapping would be less ugly if the type were set to a variable first

rivanvx updated this revision to Diff 50370.Mar 10 2016, 3:36 PM
rivanvx marked an inline comment as done.

Updated per comments by @arsenm.

nhaehnle requested changes to this revision.Mar 10 2016, 5:11 PM
nhaehnle edited edge metadata.
nhaehnle added inline comments.
2823–2824 ↗(On Diff #50370)

You removed the constraint $vdata = $vdata_in, and I don't see anything added to enforce it in a different way.

As far as I know, there is currently no way to specify that one operand must be a subregister of a different operand, and implementing such a feature would be quite a bit of work.

For image atomic cmpswap, I worked around this by saying that cmpswap returns a VReg_64, and using an EXTRACT_SUBREG in the pattern.

This revision now requires changes to proceed.Mar 10 2016, 5:11 PM
scchan added a subscriber: scchan.Mar 18 2016, 2:21 PM


I worked on a similar patch. The tests probably need adapting for your solution, but feel free to reuse any part that you find useful: (mime: text/x-patch)


jvesely added inline comments.Mar 28 2016, 12:28 PM
56–67 ↗(On Diff #48018)

the glc/slc bit behavior is what controls the atomic return behavior,

are you sure about this? according to the specs GLC controls return/noreturn (since atomics are by def. globally coherent.)
but the SLC bit controls whether the op is system level coherent which is IMO needed for HSA targets.

arsenm added inline comments.Mar 28 2016, 1:10 PM
56–67 ↗(On Diff #48018)

slc should be OK. TFE adds another change to the return type, so that is not ok

rivanvx updated this revision to Diff 51926.Mar 29 2016, 8:45 AM
rivanvx edited edge metadata.

Merged Jan's approach. It would be nice if this works, because it's a much smaller diff.

Currently fails with: llc: workspace/llvm/include/llvm/CodeGen/ValueTypes.h:249: unsigned int llvm::EVT::getVectorNumElements() const: Assertion `isVector() && "Invalid vector type!"' failed. I have to figure out why, will post an updated one later today.

rivanvx updated this revision to Diff 52005.EditedMar 29 2016, 4:45 PM
rivanvx updated this object.
rivanvx edited edge metadata.

Works now. Since there were no changes to MUBUF_Atomic multiclass, this addresses the concerns by @nhaehnle.

Existing tests with flat instructions need to be updated since now we respect the constraint output = input. Before doing that, I would like to make sure this is correct, especially LowerATOMIC_CMP_SWAP().

tstellarAMD commandeered this revision.Mar 31 2016, 1:55 PM
tstellarAMD edited reviewers, added: rivanvx; removed: tstellarAMD.

I had to make some changes to this patch in order to avoid breaking
the assembler. The main change is that the custom SDNode we use
for representing CMP_SWAP now returns a scalar value rather than
a vector value, which matches how the flat instructions are defined.

As a result of this change, I also had to add stand-alone patterns for
the mubuf atomics.

rivanvx accepted this revision.Mar 31 2016, 6:08 PM
rivanvx edited edge metadata.

LGTM and 32-bit version of the instruction tested on Tonga, works.

arsenm added inline comments.Apr 1 2016, 12:55 AM
2019 ↗(On Diff #52283)

The assert on the nose is redundant with the cast

This revision was automatically updated to reflect the committed changes.

LGTM, just one inline comment.


This should probably include the address space check of global_binary_atomic_op as well, right?

On a second look, the _nortn variant seems to be unused.