This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI
ClosedPublic

Authored by nhaehnle on Oct 11 2018, 12:26 PM.

Details

Summary

To workaround a hardware issue in the (base + offset) calculation
when base is negative. The impact on code quality should be limited
since SILoadStoreOptimizer still runs afterwards and is able to
combine loads/stores based on known sign information.

This fixes visible corruption in Hitman on SI (easily reproducible
by running benchmark mode).

Change-Id: Ia178d207a5e2ac38ae7cd98b532ea2ae74704e5f
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99923

Diff Detail

Event Timeline

nhaehnle created this revision.Oct 11 2018, 12:26 PM
arsenm added inline comments.Oct 12 2018, 4:43 AM
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
296 ↗(On Diff #169270)

This seems like an expensive check for this. Is this so important?

arsenm added inline comments.Oct 12 2018, 4:47 AM
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
296 ↗(On Diff #169270)

I mean I don't understand why this would really matter that much. If we ignore this problem and let it vectorize, the resulting code shouldn't be that different when selection fixes it. The advantage is just making the IR closer to the final hardware instructions, which has minor cost analysis benefits?

This still doesn't solve the real issue? We still have the selection. The vectorized load could appear in the original source program

This still doesn't solve the real issue? We still have the selection. The vectorized load could appear in the original source program

Well, an argument could be made that the corruption issue really only occurred because the shader was doing something that would be undefined behavior in the usual C semantics, namely accessing out-of-bounds data. What kind of semantics we want to support for such accesses is kind of up to us. Do we want to support vectorized loads that are partially out-of-bounds? GLSL says that out-of-bounds loads should be handled gracefully, but Mesa will always create scalar loads, so this particular patch is fine.

That said, I suppose we could say that we want partially out-of-bounds vectorized loads to be handled gracefully as well (with per-element out-of-bounds checks), and scalarize those vectorized loads again during selection. I haven't looked into that yet.

nhaehnle updated this revision to Diff 169857.Oct 16 2018, 10:53 AM

@arsenm How about this approach?

nhaehnle updated this revision to Diff 169860.Oct 16 2018, 10:56 AM
nhaehnle retitled this revision from AMDGPU: Restrict DS load/store vectorizing on SI to AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI.
nhaehnle edited the summary of this revision. (Show Details)
nhaehnle removed a subscriber: tomswift98.

Update the summary

Yes, something like this. I still expected to see some change in SelectDS64Bit4ByteAligned that has the fixme for this bug? I suppose the check for the sign bit would need to be here in the lowering since it would be more difficult to split during selection

lib/Target/AMDGPU/SIISelLowering.cpp
6715

NumElements == 2 is redundant and possibly wrong?

nhaehnle updated this revision to Diff 169870.Oct 16 2018, 11:33 AM

Right, I looked at SelectDS64Bit4ByteAligned, but doing the split there
requires more code and also seems wrong since it'd mean inserting
additional nodes fairly late without giving them a chance to be combined
(not that they'd be combined very often, but still).

In a sense, splitting the vectorized load is a form of legalization.

Anyway, removing the FIXME comment.

nhaehnle added inline comments.Oct 16 2018, 11:34 AM
lib/Target/AMDGPU/SIISelLowering.cpp
6715

I don't know. We shouldn't have unaligned i64 loads at this point, I guess, but the check does ensure that we're really dealing with a vector load. And NumElements > 2 is dealt with above.

nhaehnle added inline comments.Oct 16 2018, 11:41 AM
lib/Target/AMDGPU/SIISelLowering.cpp
6715

And what's more, even if we did have unaligned i64 load/store at this point, it doesn't really make sense to try to fix them. The SI bug only affects the case where the 8 bytes straddle the lower bound of LDS (i.e., vaddr == -4). Trying to load an i64 from there is wrong anyway.

arsenm accepted this revision.Oct 16 2018, 2:24 PM

LGTM. Is this only actually a problem with the UB because we don't bother trying to set m0 to the allocated size?

This revision is now accepted and ready to land.Oct 16 2018, 2:24 PM
nhaehnle updated this revision to Diff 169959.Oct 17 2018, 1:00 AM
  • actually disable the pattern to select ds_{read,write}2_b32 to catch cases where we'd regress this fix

LGTM. Is this only actually a problem with the UB because we don't bother trying to set m0 to the allocated size?

Thanks. Not quite, the issue is that the shader actually does an out-of-bounds load. That would be UB in a normal language, except that GLSL actually says it's okay in this case. Of course the shader doesn't use the resulting value (there's a loop, and the shader loads two adjacent i32 values in each loop iteration, from indices n-1 and n; the n-1 part ends up not contributing to the calculation during the first iteration).

This revision was automatically updated to reflect the committed changes.