This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement hardware bug workaround for image instructions
ClosedPublic

Authored by rdomingu on Jun 4 2020, 8:57 AM.

Details

Summary

[AMDGPU] Implement hardware bug workaround for image instructions

This implements a workaround for a hardware bug in gfx8 and gfx9,
where register usage is not estimated correctly for image_store and
image_gather4 instructions when D16 is used.

Change-Id: I4e30744da6796acac53a9b5ad37ac1c2035c8899

Diff Detail

Unit TestsFailed

Event Timeline

rdomingu created this revision.Jun 4 2020, 8:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2020, 8:57 AM
rdomingu edited the summary of this revision. (Show Details)Jun 4 2020, 10:42 AM
rdomingu added reviewers: arsenm, rampitec.
arsenm added a comment.Jun 4 2020, 1:49 PM

Also needs to update the GlobalISel path

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5932–5933

IsD16 && (!hasUnpackedD16() || (Gather4 && hasGatherbug))?

7309

You can initialize this to the target size and then avoid push_back

7310

This is the same as just MVT::v2i16

7312

Formatting

rdomingu marked 3 inline comments as done.Jun 5 2020, 6:15 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5932–5933

We want NumVDataDwords = DMaskLanes for Gather4 && IsD16 && hasGatherBug. That wouldn't be the case with your suggestion.

rdomingu updated this revision to Diff 276533.Jul 8 2020, 1:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Jul 9 2020, 4:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3279–3280

Brace formatting

3284–3286

There's no obstacle to handling v3 here, it should work in the other cases

rdomingu marked 2 inline comments as done.Jul 13 2020, 10:51 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3284–3286

Note that v3f16 is not handled in SIISelLowering.cpp either (line 7255). Are you suggesting we implement v3f16 there too?

arsenm added inline comments.Jul 13 2020, 10:56 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3284–3286

There is a v3f16 definition now, so you could. I think it's more important for globalisel to be more complete

rdomingu updated this revision to Diff 281734.Jul 29 2020, 2:02 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 281970.Jul 30 2020, 10:21 AM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 282042.Jul 30 2020, 2:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 4 2020, 7:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3286

No else after return

3301

No else after return

3303

llvm_unreachable instead of assert(alse)

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
197

I know clang-format really wants to pack these onto a single line, but it's a terrible idea and you shouldn't listen to it

rdomingu updated this revision to Diff 284443.Aug 10 2020, 10:48 AM
rdomingu marked 4 inline comments as done.

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 28 2020, 11:27 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3275

Same as above

3289

If you just resize to 8 below, why not use an array?

3290–3294

It would be preferable to emit a concat_vectors of <2 x s16> pieces here

3293

Resize here is weird. You can push back, or constructed PackedRegs to the desired size?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7310

++I

rdomingu added inline comments.Sep 10 2020, 6:55 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3290–3294

Sorry, I'm new to this. Why would concat_vectors be preferable than build_vector? Could you please elaborate?

arsenm added inline comments.Sep 16 2020, 7:58 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3290–3294

Because a G_BUILD_VECTOR with 16-bit sources isn't naturally legal. This works, it just adds more work for the legalizer to reprocess these when you could produce something that's legal to begin with to save compile time

rdomingu added inline comments.Sep 16 2020, 8:07 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3290–3294

I see. But how would you go from v3f16 to concat_vectors of <2 x 16> to v4f32 (which is what we want at the end)?

arsenm added inline comments.Sep 16 2020, 8:16 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3290–3294

I think I'm missing something. Why is this going from <3 x s16> to <4 x s32>? Isn't this the unpacked layout case? Why isn't this just an G_ANYEXT from <3 x s16> to <3 x s32>?

3293

You can do this in the initial construction, also the small size should just be 8? Or use std::array?

rdomingu added inline comments.Sep 29 2020, 7:09 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3290–3294

This is the image workaround for the packed layout case. We don't want to change the data layout which is why we shouldn't use G_ANYEXT. We just want to make the compiler think the data is twice as big.

arsenm accepted this revision.Sep 30 2020, 7:43 AM
This revision is now accepted and ready to land.Sep 30 2020, 7:43 AM
This revision was landed with ongoing or failed builds.Oct 7 2020, 5:04 AM
This revision was automatically updated to reflect the committed changes.