This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement hardware bug workaround for image instructions
ClosedPublic

Authored by rdomingu on Jun 4 2020, 8:57 AM.

Details

Summary

[AMDGPU] Implement hardware bug workaround for image instructions

This implements a workaround for a hardware bug in gfx8 and gfx9,
where register usage is not estimated correctly for image_store and
image_gather4 instructions when D16 is used.

Change-Id: I4e30744da6796acac53a9b5ad37ac1c2035c8899

Diff Detail

Event Timeline

rdomingu created this revision.Jun 4 2020, 8:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2020, 8:57 AM
rdomingu edited the summary of this revision. (Show Details)Jun 4 2020, 10:42 AM
rdomingu added reviewers: arsenm, rampitec.
arsenm added a comment.Jun 4 2020, 1:49 PM

Also needs to update the GlobalISel path

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
6034–6035

IsD16 && (!hasUnpackedD16() || (Gather4 && hasGatherbug))?

7455

You can initialize this to the target size and then avoid push_back

7456

This is the same as just MVT::v2i16

7458

Formatting

rdomingu marked 3 inline comments as done.Jun 5 2020, 6:15 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
6034–6035

We want NumVDataDwords = DMaskLanes for Gather4 && IsD16 && hasGatherBug. That wouldn't be the case with your suggestion.

rdomingu updated this revision to Diff 276533.Jul 8 2020, 1:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Jul 9 2020, 4:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3550–3551

Brace formatting

3555–3557

There's no obstacle to handling v3 here, it should work in the other cases

rdomingu marked 2 inline comments as done.Jul 13 2020, 10:51 AM
rdomingu added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3555–3557

Note that v3f16 is not handled in SIISelLowering.cpp either (line 7255). Are you suggesting we implement v3f16 there too?

arsenm added inline comments.Jul 13 2020, 10:56 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3555–3557

There is a v3f16 definition now, so you could. I think it's more important for globalisel to be more complete

rdomingu updated this revision to Diff 281734.Jul 29 2020, 2:02 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 281970.Jul 30 2020, 10:21 AM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

rdomingu updated this revision to Diff 282042.Jul 30 2020, 2:09 PM

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 4 2020, 7:30 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3589

No else after return

3604

No else after return

3606

llvm_unreachable instead of assert(alse)

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
305

I know clang-format really wants to pack these onto a single line, but it's a terrible idea and you shouldn't listen to it

rdomingu updated this revision to Diff 284443.Aug 10 2020, 10:48 AM
rdomingu marked 4 inline comments as done.

Updating D81172: [AMDGPU] Implement hardware bug workaround for image instructions

arsenm added inline comments.Aug 28 2020, 11:27 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3560

If you just resize to 8 below, why not use an array?

3561–3565

It would be preferable to emit a concat_vectors of <2 x s16> pieces here

3564

Resize here is weird. You can push back, or constructed PackedRegs to the desired size?

3570

Same as above

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7456

++I

rdomingu added inline comments.Sep 10 2020, 6:55 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3561–3565

Sorry, I'm new to this. Why would concat_vectors be preferable than build_vector? Could you please elaborate?

arsenm added inline comments.Sep 16 2020, 7:58 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3561–3565

Because a G_BUILD_VECTOR with 16-bit sources isn't naturally legal. This works, it just adds more work for the legalizer to reprocess these when you could produce something that's legal to begin with to save compile time

rdomingu added inline comments.Sep 16 2020, 8:07 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3561–3565

I see. But how would you go from v3f16 to concat_vectors of <2 x 16> to v4f32 (which is what we want at the end)?

arsenm added inline comments.Sep 16 2020, 8:16 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3561–3565

I think I'm missing something. Why is this going from <3 x s16> to <4 x s32>? Isn't this the unpacked layout case? Why isn't this just an G_ANYEXT from <3 x s16> to <3 x s32>?

3564

You can do this in the initial construction, also the small size should just be 8? Or use std::array?

rdomingu added inline comments.Sep 29 2020, 7:09 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
3561–3565

This is the image workaround for the packed layout case. We don't want to change the data layout which is why we shouldn't use G_ANYEXT. We just want to make the compiler think the data is twice as big.

arsenm accepted this revision.Sep 30 2020, 7:43 AM
This revision is now accepted and ready to land.Sep 30 2020, 7:43 AM
This revision was landed with ongoing or failed builds.Oct 7 2020, 5:04 AM
This revision was automatically updated to reflect the committed changes.