This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add a function attribute that shrinks buggy s_buffer opcodes on GFX9
AbandonedPublic

Authored by mareko on Jan 15 2018, 9:41 AM.

Details

Reviewers
arsenm
nhaehnle
Summary

This shouldn't be set for shaders running in an untrusted environment.

42952 affected shaders.
Code size in affected shaders: -12.64%

Diff Detail

Event Timeline

mareko created this revision.Jan 15 2018, 9:41 AM
arsenm added inline comments.Jan 15 2018, 10:04 AM
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
165–166

I don't understand why the function attribute for a hardware bug. If this happens to help other targets not working around the bug we should add an optimization to do this somewhere.

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

We shouldn't be putting bug workarounds in an optimization pass. This should probably be part of the initial selection

mareko added inline comments.Jan 15 2018, 10:23 AM
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
165–166

Can you be more specific? I don't understand. Note that Mesa may set the function attribute differently for each shader, because it depends on whether the shader is trusted on untrusted.

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

What initial section are you talking about? Note that the placement of this pass is perfect, because the new code needs to run after SILoadStoreOptimizer.

arsenm added inline comments.Jan 15 2018, 10:29 AM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

One option would be to do this in the DAG and split the intrinsics before getting selected in the first place, or we could do this in AdjustInstrPostInstrSelection or in a new bug workaround pass.

Why does it need to be after SILoadStoreOptimizer? Is it just because it will try to merge and form these? If it's buggy it should just not do that in the first place.

mareko added inline comments.Jan 15 2018, 11:23 AM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

The intrinsic is translated into s_buffer_load_dword only. There are no intrinsics for the xN opcodes - this is actually the optimal situation because we have SILoadStoreOptimizer. SILoadStoreOptimizer merges s_buffer_load_dword into x2 and x4 - this is the only place that generates the xN opcodes. This new code needs to run after that to convert s_buffer_load into s_load for xN opcodes where N >= 2.

dstuttard added inline comments.Jan 17 2018, 1:05 AM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

I'm planning to add intrinsics that will allow multi-dword s_buffer_loads as this is considerably easier for our front-end. However, I guess the placement of this work-around means that this won't matter.

I agree that the work-around being after SILoadStoreOptimizer looks like the best place, but whether that should be as a separate bug workaround pass as Matt suggests is arguable - Matt, do you think it is just cleaner to have it pulled into a separate pass?

nhaehnle added inline comments.Jan 24 2018, 9:18 AM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips?

By the way, we may want to assume that buffers are aligned to e.g. 16 bytes and make this dependent on the alignment, e.g. if the low 4 bits of the saddr are known to be zero, we should still be able to use x4.

mareko added inline comments.Jan 25 2018, 12:11 PM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips

That's the current behavior, and it increases code size by 14.4%.

Loads are only guaranteed to be aligned to 4 bytes.

nhaehnle accepted this revision.Jan 26 2018, 8:15 AM

Okay, I see it now, and I can live with it. Please still consider the alternative alignment approach - it is a bit more restrictive in what it can do, but it could be enabled unconditionally.

lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

Loads are only guaranteed to be aligned to 4 bytes.

We could easily increase the required alignment on UBOs in radeonsi, let's say to 16 or even 64 bytes. I believe some other drivers set it to 256 bytes, which is the maximum allowed by the OpenGL spec. Then checking alignment for constant offsets becomes trivial, and for non-constant offsets we could still use computeKnownBits.

This revision is now accepted and ready to land.Jan 26 2018, 8:15 AM
mareko added inline comments.Jan 28 2018, 4:23 PM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

The problem is, even when we set the required UBO alignment to 16, there is not guarantee that s_buffer_load_dwordx4 will be aligned to 16, because s_buffer_load_dwordx4 is constructed from multiple s_buffer_load_dword which load consecutive dwords, but it doesn't mean that the first one is aligned to 16. Or course, we could enforce the literal offset to be a multiple of 16, but I think that would decrease the number of opportunities for load opcode merging.

nhaehnle added inline comments.Jan 30 2018, 2:53 AM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

Yes, that's precisely what I meant. Yes, it provides a smaller number of opportunities, but it could be enabled unconditionally without a special flag. So it's still a useful trade-off.

mareko added inline comments.Jan 30 2018, 4:10 AM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

I'm a little concerned about app compatibility if we increase the alignment. 256 bytes was for ancient hw.

arsenm requested changes to this revision.Feb 1 2018, 9:18 AM
arsenm added inline comments.
lib/Target/AMDGPU/SIShrinkInstructions.cpp
310–312

This cannot be in this optimization pass. This must be done somewhere else

This revision now requires changes to proceed.Feb 1 2018, 9:18 AM
mareko abandoned this revision.Feb 1 2018, 9:38 AM

The workaround is not needed.