This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Do not combine loads/store across physreg defs
ClosedPublic

Authored by nhaehnle on Nov 22 2017, 4:20 AM.

Details

Summary

Since this pass operates on machine SSA form, this should only really
affect M0 in practice.

Fixes various piglit variable-indexing/vs-varying-array-mat4-index-*

Change-Id: Ib2a1dc3a8d7b08225a8da49a86f533faa0986aa8
Fixes: r317751 ("AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4")

Diff Detail

Repository
rL LLVM

Event Timeline

nhaehnle created this revision.Nov 22 2017, 4:20 AM
rampitec edited edge metadata.Nov 22 2017, 11:14 AM

As far as I understand that is only a concern if defined register is M0 since it is read by the ds_* instructions. I.e. it should be better to check to M0, not just any physreg.
Also this should not be a concern on GFX9 since we have lds instructions which do not read M0 there (check Subtarget->ldsRequiresM0Init()).

arsenm added inline comments.Nov 27 2017, 10:14 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
359 ↗(On Diff #123905)

There could also be a call instruction, and the function may modify m0 or anything else

test/CodeGen/AMDGPU/llvm.amdgcn.interp.ll
185 ↗(On Diff #123905)

Can you add a test with a no inline call to make sure that works correctly?

As far as I understand that is only a concern if defined register is M0 since it is read by the ds_* instructions. I.e. it should be better to check to M0, not just any physreg.
Also this should not be a concern on GFX9 since we have lds instructions which do not read M0 there (check Subtarget->ldsRequiresM0Init()).

The LDS use here is a bit of a red herring; it's really not about that. The original case where I found the bug had no "proper" LDS instructions at all, only v_interp, and the merged memory instructions were buffer instructions. You could probably construct a similar case also with e.g. only relative indexing with different indices, or with relative indexing and s_sendmsg (which uses M0).

The real problem is that the pass assumes machine-SSA form, and this assumption is broken with physreg-defs. Here's what happens:

%vreg0 = mem-instruction-1
M0 = def-1
use(%vreg0, M0)
M0 = def-2
...
mem-instruction-2

Without this fix, this gets changed to:

M0 = def-1
M0 = def-2
...
%vreg0, ... = merged-mem-instruction
use(%vreg0, M0)

So the %vreg0-use gets moved (because it depends on mem-instruction-1), without regard for the fact that the instruction reads a register that is later overwritten by a different value.

This possible write-after-read hazard of registers is something that the pass simply hasn't tracked before, because for virtual registers in machine-SSA its unnecessary. It's only with physical registers that it becomes necessary; we don't have many of those at this stage in the flow in practice, but M0 is not a special case.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
359 ↗(On Diff #123905)

To clarify, this means that we should bail out entirely when we encounter a call instruction, right?

As far as I understand that is only a concern if defined register is M0 since it is read by the ds_* instructions. I.e. it should be better to check to M0, not just any physreg.
Also this should not be a concern on GFX9 since we have lds instructions which do not read M0 there (check Subtarget->ldsRequiresM0Init()).

The LDS use here is a bit of a red herring; it's really not about that. The original case where I found the bug had no "proper" LDS instructions at all, only v_interp, and the merged memory instructions were buffer instructions. You could probably construct a similar case also with e.g. only relative indexing with different indices, or with relative indexing and s_sendmsg (which uses M0).

The real problem is that the pass assumes machine-SSA form, and this assumption is broken with physreg-defs. Here's what happens:

%vreg0 = mem-instruction-1
M0 = def-1
use(%vreg0, M0)
M0 = def-2
...
mem-instruction-2

Without this fix, this gets changed to:

M0 = def-1
M0 = def-2
...
%vreg0, ... = merged-mem-instruction
use(%vreg0, M0)

So the %vreg0-use gets moved (because it depends on mem-instruction-1), without regard for the fact that the instruction reads a register that is later overwritten by a different value.

This possible write-after-read hazard of registers is something that the pass simply hasn't tracked before, because for virtual registers in machine-SSA its unnecessary. It's only with physical registers that it becomes necessary; we don't have many of those at this stage in the flow in practice, but M0 is not a special case.

I see. But my point is we can easily get the situation:

m0 = def-1
ds_read
m0 = def-2
ds_read

That worth nothing that defs are euqal in our case. Unless we are going to prove defs are equal we shall be unable to combine those two ds_reads. Now with your patch we will be unable to combine them.
This is fine unless you think about GFX9 versions of ds_read which do not imp-use M0. We still shall be able to combine them, but we will not after this patch.

So what I mean not just bail out on physreg def, but check if the register is actually used by any instructions you are going to move before the required def.
Does it make sense to you?

arsenm edited edge metadata.Nov 28 2017, 12:02 PM

This pass should probably ignore Subtarget->ldsRequiresM0Init(). The instructions are selected to a set without m0, so this should be able to just check the register uses

This pass should probably ignore Subtarget->ldsRequiresM0Init(). The instructions are selected to a set without m0, so this should be able to just check the register uses

Right, I do not propose to check for a feature. I propose to check for actual register uses.

arsenm added inline comments.Nov 28 2017, 12:14 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
359 ↗(On Diff #123905)

I think this would just fall out naturally from physreg defs (although I guess you could have a void call that doesn't modify any visible registers), but you might need a call check

We need this fix in LLVM 6.0, right?

nhaehnle updated this revision to Diff 131807.Jan 29 2018, 8:18 AM

Add a non-inline call test.

nhaehnle marked an inline comment as done.Jan 29 2018, 8:19 AM

I would like to commit this as-is, since it's a bug fix. I have a change which tracks def-use as well, I'll upload it in a moment as a related revision, and would like to commit it separately.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
359 ↗(On Diff #123905)

I've added a test case to ds_read2.ll. It passes without modifications to the code.

mareko accepted this revision.Jan 29 2018, 2:26 PM
This revision is now accepted and ready to land.Jan 29 2018, 2:26 PM
This revision was automatically updated to reflect the committed changes.