SMEM opcodes are faster, so we want to use them if possible.
Deus Ex performance: +13%
(with on-demand shader compilation enabled in Deus Ex, so the real
improvement should be higher)
I'm not sure I understand what the point of the AnyReg parameter is
Can you move this getGeneration check into a new Subtarget->hasGLCForSMEM()?
Do you also ne ed to handle x8/x16? (those can probably be a separate patch, but for now an assert/unreachable woul dwork)
These should use getNamedOperand
These look backwards from the order (it's also called offset, not inst_offset).
Tests seem to be missing for the moveToVALU path (or should those all be covered by the existing intrinsic tests?)
I see the same problem as Tom here. Do those shaders use read-only SSBOs? If so, this could perhaps be done at the Mesa level. But even then, there'd be a problem if the same memory is bound to two different SSBOs, and one of them is written to, unless the SSBO is marked 'restrict'.
Can you be more specific about why it's incorrect? I only see an issue with L1 coherency (GLC=0).
I'll change that to "AllowNonConst". It should be obvious if you look below.
No. They can't be selected for amdgcn.buffer.load.
The order is the same as in the .inc files.
They are all covered by existing tests.
Another possible issue is that SMEM instructions ignore bits of the resource descriptor. So you would need some way to tell the compiler that it wouldn't be ignoring some relevant resource bits by selecting to SMEM.
Doesn't this make this change unworkable? Presumably the front-end would need to annotate in some way to indicate that this is a legitimate transformation, in which case you might as well use a different intrinsic anyway. Are there any circumstances where you can determine if this is definitely the case?
I've got a situation that benefits from this change, but equally could use the solution in D27586. Perhaps that change could be enhanced with the non-const offset change in this review?
Yes, having separate intrinsics like D27586 is preferable not just because of the sL1 vs vL1 coherency stuff, but also because SMEM instructions have many differences compared to VMEM and sometimes even the same looking VMEM and SMEM instructions have different behavior. It's also important that s.load intrinsics support non-constant offsets and are lowered to VMEM when the address comes from a VGPR.