This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement hardware bug workaround for image instructions
ClosedPublic

Authored by rdomingu on Jun 4 2020, 8:57 AM.

Details

Summary

[AMDGPU] Implement hardware bug workaround for image instructions

This implements a workaround for a hardware bug in gfx8 and gfx9,
where register usage is not estimated correctly for image_store and
image_gather4 instructions when D16 is used.

Change-Id: I4e30744da6796acac53a9b5ad37ac1c2035c8899

Diff Detail

Event Timeline

rdomingu created this revision.Jun 4 2020, 8:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2020, 8:57 AM
rdomingu edited the summary of this revision. (Show Details)Jun 4 2020, 10:42 AM
rdomingu added reviewers: arsenm, rampitec.
arsenm added a comment.Jun 4 2020, 1:49 PM

Also needs to update the GlobalISel path

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5890–5891

IsD16 && (!hasUnpackedD16() || (Gather4 && hasGatherbug))?

7280

You can initialize this to the target size and then avoid push_back

7281

This is the same as just MVT::v2i16

7283

Formatting

rdomingu marked 3 inline comments as done.Jun 5 2020, 6:15 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5890–5891

We want NumVDataDwords = DMaskLanes for Gather4 && IsD16 && hasGatherBug. That wouldn't be the case with your suggestion.

rdomingu updated this revision to Diff 276533.Jul 8 2020, 1:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Jul 9 2020, 4:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3221–3222

Brace formatting

3226–3228

There's no obstacle to handling v3 here, it should work in the other cases

rdomingu marked 2 inline comments as done.Jul 13 2020, 10:51 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3226–3228

Note that v3f16 is not handled in SIISelLowering.cpp either (line 7255). Are you suggesting we implement v3f16 there too?

arsenm added inline comments.Jul 13 2020, 10:56 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3226–3228

There is a v3f16 definition now, so you could. I think it's more important for globalisel to be more complete

rdomingu updated this revision to Diff 281734.Jul 29 2020, 2:02 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 281970.Jul 30 2020, 10:21 AM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 282042.Jul 30 2020, 2:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 4 2020, 7:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3228

No else after return

3243

No else after return

3245

llvm_unreachable instead of assert(alse)

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
197

I know clang-format really wants to pack these onto a single line, but it's a terrible idea and you shouldn't listen to it

rdomingu updated this revision to Diff 284443.Aug 10 2020, 10:48 AM
rdomingu marked 4 inline comments as done.

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 28 2020, 11:27 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3217

Same as above

3231

If you just resize to 8 below, why not use an array?

3232–3236

It would be preferable to emit a concat_vectors of <2 x s16> pieces here

3235

Resize here is weird. You can push back, or constructed PackedRegs to the desired size?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7281

++I

rdomingu added inline comments.Sep 10 2020, 6:55 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3232–3236

Sorry, I'm new to this. Why would concat_vectors be preferable than build_vector? Could you please elaborate?

arsenm added inline comments.Sep 16 2020, 7:58 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3232–3236

Because a G_BUILD_VECTOR with 16-bit sources isn't naturally legal. This works, it just adds more work for the legalizer to reprocess these when you could produce something that's legal to begin with to save compile time

rdomingu added inline comments.Sep 16 2020, 8:07 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3232–3236

I see. But how would you go from v3f16 to concat_vectors of <2 x 16> to v4f32 (which is what we want at the end)?

arsenm added inline comments.Sep 16 2020, 8:16 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3232–3236

I think I'm missing something. Why is this going from <3 x s16> to <4 x s32>? Isn't this the unpacked layout case? Why isn't this just an G_ANYEXT from <3 x s16> to <3 x s32>?

3235

You can do this in the initial construction, also the small size should just be 8? Or use std::array?

rdomingu added inline comments.Sep 29 2020, 7:09 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3232–3236

This is the image workaround for the packed layout case. We don't want to change the data layout which is why we shouldn't use G_ANYEXT. We just want to make the compiler think the data is twice as big.

arsenm accepted this revision.Sep 30 2020, 7:43 AM
This revision is now accepted and ready to land.Sep 30 2020, 7:43 AM
This revision was landed with ongoing or failed builds.Oct 7 2020, 5:04 AM
This revision was automatically updated to reflect the committed changes.