This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement hardware bug workaround for image instructions
ClosedPublic

Authored by rdomingu on Jun 4 2020, 8:57 AM.

Details

Summary

[AMDGPU] Implement hardware bug workaround for image instructions

This implements a workaround for a hardware bug in gfx8 and gfx9,
where register usage is not estimated correctly for image_store and
image_gather4 instructions when D16 is used.

Change-Id: I4e30744da6796acac53a9b5ad37ac1c2035c8899

Diff Detail

Event Timeline

rdomingu created this revision.Jun 4 2020, 8:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2020, 8:57 AM
rdomingu edited the summary of this revision. (Show Details)Jun 4 2020, 10:42 AM
rdomingu added reviewers: arsenm, rampitec.
arsenm added a comment.Jun 4 2020, 1:49 PM

Also needs to update the GlobalISel path

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5370–5371

IsD16 && (!hasUnpackedD16() || (Gather4 && hasGatherbug))?

6700

You can initialize this to the target size and then avoid push_back

6701

This is the same as just MVT::v2i16

6703

Formatting

rdomingu marked 3 inline comments as done.Jun 5 2020, 6:15 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5370–5371

We want NumVDataDwords = DMaskLanes for Gather4 && IsD16 && hasGatherBug. That wouldn't be the case with your suggestion.

rdomingu updated this revision to Diff 276533.Jul 8 2020, 1:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Jul 9 2020, 4:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3221–3222 ↗(On Diff #276533)

Brace formatting

3226–3228 ↗(On Diff #276533)

There's no obstacle to handling v3 here, it should work in the other cases

rdomingu marked 2 inline comments as done.Jul 13 2020, 10:51 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3226–3228 ↗(On Diff #276533)

Note that v3f16 is not handled in SIISelLowering.cpp either (line 7255). Are you suggesting we implement v3f16 there too?

arsenm added inline comments.Jul 13 2020, 10:56 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3226–3228 ↗(On Diff #276533)

There is a v3f16 definition now, so you could. I think it's more important for globalisel to be more complete

rdomingu updated this revision to Diff 281734.Jul 29 2020, 2:02 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 281970.Jul 30 2020, 10:21 AM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 282042.Jul 30 2020, 2:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 4 2020, 7:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3286 ↗(On Diff #282042)

No else after return

3301 ↗(On Diff #282042)

No else after return

3303 ↗(On Diff #282042)

llvm_unreachable instead of assert(alse)

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
317

I know clang-format really wants to pack these onto a single line, but it's a terrible idea and you shouldn't listen to it

rdomingu updated this revision to Diff 284443.Aug 10 2020, 10:48 AM
rdomingu marked 4 inline comments as done.

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 28 2020, 11:27 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3410 ↗(On Diff #284443)

If you just resize to 8 below, why not use an array?

3411–3415 ↗(On Diff #284443)

It would be preferable to emit a concat_vectors of <2 x s16> pieces here

3414 ↗(On Diff #284443)

Resize here is weird. You can push back, or constructed PackedRegs to the desired size?

3420 ↗(On Diff #284443)

Same as above

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
6701

++I

rdomingu added inline comments.Sep 10 2020, 6:55 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3411–3415 ↗(On Diff #284443)

Sorry, I'm new to this. Why would concat_vectors be preferable than build_vector? Could you please elaborate?

arsenm added inline comments.Sep 16 2020, 7:58 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3411–3415 ↗(On Diff #284443)

Because a G_BUILD_VECTOR with 16-bit sources isn't naturally legal. This works, it just adds more work for the legalizer to reprocess these when you could produce something that's legal to begin with to save compile time

rdomingu added inline comments.Sep 16 2020, 8:07 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3411–3415 ↗(On Diff #284443)

I see. But how would you go from v3f16 to concat_vectors of <2 x 16> to v4f32 (which is what we want at the end)?

arsenm added inline comments.Sep 16 2020, 8:16 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3411–3415 ↗(On Diff #284443)

I think I'm missing something. Why is this going from <3 x s16> to <4 x s32>? Isn't this the unpacked layout case? Why isn't this just an G_ANYEXT from <3 x s16> to <3 x s32>?

3414 ↗(On Diff #284443)

You can do this in the initial construction, also the small size should just be 8? Or use std::array?

rdomingu added inline comments.Sep 29 2020, 7:09 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3411–3415 ↗(On Diff #284443)

This is the image workaround for the packed layout case. We don't want to change the data layout which is why we shouldn't use G_ANYEXT. We just want to make the compiler think the data is twice as big.

arsenm accepted this revision.Sep 30 2020, 7:43 AM
This revision is now accepted and ready to land.Sep 30 2020, 7:43 AM
This revision was landed with ongoing or failed builds.Oct 7 2020, 5:04 AM
This revision was automatically updated to reflect the committed changes.