This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Improve v_cmpx usage on GFX10.3.
ClosedPublic

Authored by tsymalla on Feb 14 2022, 1:37 AM.

Details

Summary

On GFX10.3 targets, the following instruction sequence

v_cmp_* SGPR, ...
s_and_saveexec ..., SGPR

leads to a fairly long stall caused by a VALU write to a SGPR and having the
following SALU wait for the SGPR.

An equivalent sequence is to save the exec mask manually instead of letting
s_and_saveexec do the work and use a v_cmpx instruction instead to do the
comparison.

This patch modifies the SIOptimizeExecMasking pass as this is the last position
where s_and_saveexec instructions are inserted. It does the transformation by
trying to find the pattern, extracting the operands and generating the new
instruction sequence.

It also changes some existing lit tests and introduces a few new tests to show
the changed behavior on GFX10.3 targets.

Diff Detail

Event Timeline

tsymalla created this revision.Feb 14 2022, 1:37 AM
tsymalla requested review of this revision.Feb 14 2022, 1:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2022, 1:37 AM
sebastian-ne added inline comments.
llvm/test/CodeGen/AMDGPU/wqm.ll
384–386

I think the v_cmpx instruction should be inserted the the place of the s_and_saveexec, not at the place of the v_cmp. Otherwise this v_mov gets executed with the wrong exec mask.

tsymalla added inline comments.Feb 14 2022, 5:50 AM
llvm/test/CodeGen/AMDGPU/wqm.ll
384–386

Thanks, going to fix this.

Thank you for doing ths. Please make sure to address that clang-format complaints. I have a bunch of other comments inline.

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
301–303

Just look at the operands explicitly by argument instead of iterating over them.

305

Can it be anything else? This should probably just be an assertion.

325

VCmpDest should be equal to Reg by definition. Again, this can be an assertion.

326

This doesn't work how you want it to work because after SSA elimination, if you need to iterate over defs then almost certainly you have multiple uses as well.

Do you have liveness information available and can you use it to get to the relevant defining instruction in a more targeted way?

Alternatively: this transform is really only helpful if the V_CMP and S_AND_SAVEEXEC are close together. If the V_CMP appears much earlier in the code (perhaps even in a different basic block!) then we shouldn't do the transform. So a better approach may be to do a bounded backwards walk through instructions. (This interacts with the point about checking for intermediate EXEC writes in a later comment.)

359–363

There shouldn't be any EXEC modifications between those two instructions. If there are, it's a sign that something really weird is going on and certainly this trick isn't sufficient to ensure correctness. You should probably move this check into the find function and have that function bail out if it finds an EXEC modification.

574–575

Spurious whitespace change.

639

gfx10.3 and later.

651–662

The S_AND_SAVEEXEC_Bnn instruction has to be one of the last instructions of the basic block. It actually gets derived from a terminator instruction, though for some reason we apparently never changed the S_xxx_SAVEEXEC instructions to be terminators as well.

In any case, what you should do here from a compile-time perspective is to iterate basic blocks backwards and limit the lop to a small number of iterations. Maybe 5?

Also, this loop should break if it finds a different instruction that writes EXEC (shouldn't happen in practice, but...).

llvm/lib/Target/AMDGPU/VOPCInstructions.td
222

Shouldn't the is_vcmp field here logically be 1?

259

Why the complicated expression for is_vcmp? Shouldn't that just be 0?

More to the point, VOPC_Pseudo already derives from VCMPVCMPXTable, so this is rather dodgy. I'd suggest changing the table to have an is_cmpx field and using a let here to override the field's default value from the VOPC_Pseudo.

tsymalla marked 6 inline comments as done.Feb 14 2022, 10:19 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
359–363

This is possible. In wqm.ll, check the following sequence:

; GFX10-W32-NEXT:    v_cmp_nlt_f32_e32 vcc_lo, 0, v0
; GFX10-W32-NEXT:    s_and_b32 exec_lo, exec_lo, s12
; GFX10-W32-NEXT:    buffer_store_dword v4, v2, s[0:3], 0 idxen
; GFX10-W32-NEXT:    ; implicit-def: $vgpr0
; GFX10-W32-NEXT:    s_wqm_b32 exec_lo, exec_lo
; GFX10-W32-NEXT:    s_and_saveexec_b32 s13, vcc_lo

Here, the second instruction modifies the exec register.
However, I investigated that case already and found out that there's no way to give correctness here, so I decided to do as you proposed and moved the check into the find function. :-)

llvm/test/CodeGen/AMDGPU/wqm.ll
384–386

Apparently, there is another case to consider. Moving the v_cmpx to the place of the s_and_saveexec will use the wrong value for v0, so a copy needs to be inserted whenever a v_cmp register source operand gets redefined inbetween in the original sequence, and the cmpx needs to use the new operand.

tsymalla marked 3 inline comments as done.Feb 15 2022, 6:49 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
326

Working with backwards checking now, which should work. Thanks.

llvm/lib/Target/AMDGPU/VOPCInstructions.td
222

I renamed the variables. What I want to achieve is here is to map v_cmpx instructions to their v_cmp pendants in the mapping.

259

Using just 0 would not work as there are other VOPC instructions which would then be inserted into the result table, which is not wanted.

The idea is the following:

Evaluate if subst(op, v_cmpx, v_cmp) != v_cmpx which should yield true if the substitution succeeds. This is, if the instruction starts with v_cmpx. This would be way easier to do if TableGen would support string operations like "beginsWith". However, the way you proposed should also work, so I'm going to change this!

tsymalla marked 2 inline comments as done.Feb 16 2022, 4:28 AM
tsymalla added inline comments.Feb 16 2022, 5:01 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
651–662

Looking at the other changes, this transformation should only be applied when it doesn't break correctness or doesn't introduce overhead (spilling). In the next revision, the find instruction looks at the instructions of a basic block backwards from a given s_and_saveexec instruction and breaks whenever one of these cases appear. I am going to change that again so that it only iterates a maximum amount of instructions backwards when trying to find the compare instruction which should reduce the work to do.

arsenm added inline comments.Feb 16 2022, 6:49 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
639

Why only do this on gfx10.3? Every target has v_cmpx?

arsenm added inline comments.Feb 16 2022, 6:51 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
320

Look at the def operands directly instead? Also since this is post-SSA you need to defend against subregisters

tsymalla marked an inline comment as done.Feb 16 2022, 6:58 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
639

We decided to do so because it is unclear if on < GFX10 this gives any performance advantage and on 10.1 / 10.2 an additional s_waitcnt_depctr needs to be inserted for correctness, so probably there won't be any performance gain from this on these targets.

@foad might elaborate more details for this.

foad added inline comments.Feb 16 2022, 7:23 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
639

I don't have much to add. The performance of v_cmp followed by s_and_savexec is particularly bad on GFX10 because of instruction counter stalls. So yes we probably could do this pre GFX10, but I don't know if there's any reason to.

For the bugs in GFX10.1 and GFX10.2 see GCNHazardRecognizer::fixVcmpxExecWARHazard. Having to insert S_WAITCNT_DEPCTR probably negates any advantage you get from using v_cmpx in the first place.

tsymalla updated this revision to Diff 409310.Feb 16 2022, 10:01 AM
tsymalla marked 5 inline comments as done.

Addresses the comments from the review.
Adds additional checks to prevent the transformation from
being applied, if the result would be less optimal than
the original code or simply incorrect.

It would probably make sense to only insert the pattern if Exec is not only not written but not implicitly used by a VALU instruction between the cmp and the saveexec - otherwise, the VALU will be executed with the wrong exec mask.

sebastian-ne added inline comments.Feb 17 2022, 5:15 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
303

Instead of taking a SmallVector, you can take an ArrayRef<MCRegister>, then you can also pass {SaveExecSrc0->getReg()} without needing a vector.

Also, if you needed a vector here because it’s modified inside the function, it is considered better style to pass a SmallVectorImpl<MCRegister>, so that the caller can decide the number for inline allocation.

310–322

I think this could be clearer:

if ((Pred(&*A))
  return &*A;

if (DisallowDefBetween) {
  // …
}
++CurrentIteration;
llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll
1

This file doesn’t look autogenerated, but maybe it makes sense to generate it.

The pre-merge builds report some test failure in MC/AMDGPU and MC/Disassembler/AMDGPU. I think the assembler fix in VOPCInstructions.td and these test changes could be a separate patch.

llvm/test/CodeGen/AMDGPU/wqm.ll
2962–2965

This and the test above with the buffer_store_dword still look incorrect.

tsymalla updated this revision to Diff 409670.Feb 17 2022, 8:29 AM
tsymalla marked 3 inline comments as done.

Addressing review changes, reducing the amount of cases where the
transformation takes places.

llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll
1

The changes here are minor, and I am not really sure what this test actually tests, so I am just going to remove the auto-generate line for now.

sebastian-ne added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
326–330

I’m not sure what I thought before, but a buffer_store obviously only changes memory and not its register arguments, so it’s completely fine if a buffer_store is between the v_cmp and the s_and_saveexec. Checking modifiesRegister should be enough.

356

There’s quite a lot of these isReg and !getSubReg checks in this patch. What are they guarding? I replaced them by asserts and there was no amdgpu codegen test that failed. (Leaving aside the isReg test for v_cmp arguments, which can also be constants.)

653

Not sure if this was already mentioned, but do you make sure that the v_cmp is not used anywhere outside the s_saveexec?

I guess it would still be beneficial to insert a v_cmpx if the v_cmp has more uses, but it shouldn’t be removed in that case.

tsymalla added inline comments.Feb 18 2022, 7:12 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
326–330

You are correct, any memory load or store will just update the contents behind the address stored in the VGPR. Thanks for that. I did not think thoroughly about this change.

356

Regarding the v_cmp arguments, these are constants as you mentioned so these checks make sense.
However, I think we can safely assume that the check for s_and_saveexec will always pass as the arguments are registers, so I'm going to remove these.

653

No, I did not take this into account (it should mostly just write to VCC as this is checking sequences of s_and_saveexec and v_cmp).
I am going to add some more control flow for cases like this.

I don't really get your last comment though. Why would it beneficial to add the v_cmpx in addition to the v_cmp in cases like this?

Joe_Nash added inline comments.Feb 18 2022, 1:11 PM
llvm/lib/Target/AMDGPU/VOPCInstructions.td
268

I think this can be
"let IsVCMPX = 1;"
Maybe it happened when you inverted the table to have IsVCMPX instead of isVCMP.

tsymalla marked 3 inline comments as done.Feb 19 2022, 2:10 AM
tsymalla added inline comments.Feb 19 2022, 2:10 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
653

Looking at this again.

The result of v_cmp will be VCC = Exec & ThreadsPassed, the result of v_cmpx will be Exec[Executing Thread] = bool for each executing thread. It could be that the result inside the VCC Is used as integer carry, so in these cases the transformation would be wrong. Another case which was already discussed was the early-exit option by using VCCZ, but I think it makes sense to ignore this for now.

It would require additional logic to make sure any following instructions using VCC (if there are any) are using the correct input value, which probably reduces the gain here - so I'm going to just don't do the transformation if any following instruction uses the destination operand of v_cmp.

arsenm added inline comments.Feb 19 2022, 6:40 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
639

Should explain this in the comment

tsymalla updated this revision to Diff 410477.Feb 22 2022, 1:33 AM

Adding liveness checks to prevent accidental pattern transformation.
Addressed review comments.

tsymalla marked 2 inline comments as done.Feb 22 2022, 1:34 AM
sebastian-ne added inline comments.Feb 22 2022, 2:25 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
356

Did you remove any of these checks? I still see a lot of isReg and getSubReg checks

362–363

As far as I recall, the stall between v_cmp and s_and_saveexec is quite long, so

v_cmp
s_mov
v_cmpx

is probably faster than

v_cmp
s_and_saveexec

and it’s worth to do this transformation and keep the v_cmp around.
@foad?

tsymalla added inline comments.Feb 22 2022, 2:43 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
356

Yes, I removed the check around the s_and_saveexec operand. For the v_cmp checks, either src0 or src1 could be a register and shouldn't be a subregister, so I think these checks still make sense.

sebastian-ne added inline comments.Feb 22 2022, 2:56 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
356

Ah, I missed that. I’m still wondering what the getSubReg checks do?
There’s no change on the llvm tests with or without them. I guess a test that hits these code paths would be nice?

foad added inline comments.Feb 22 2022, 3:23 AM
llvm/lib/Target/AMDGPU/GCNHazardRecognizer.cpp
896 ↗(On Diff #410477)

Commit this separately, no review required.

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
643

Don't need &* here.

llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
730–731

Typo "write explicitly write"

730–731

Use "DstReg"?

734

"continue" doesn't seem right. We *should* be shrinking cmpx instructions from VOP3 encoding to VOPC encoding.

tsymalla added inline comments.Feb 22 2022, 3:34 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
356

As far as I understood, after register allocation, no subregister indices should be used as we are dealing with physical registers. In Post-SSA there can still be virtual registers. These subregs can be used to get lane bitmasks which represent the parts that are covered by the subregister index if the register is still virtual. So, this should be a check for completeness, but it's probably not necessary here. I have seen some assertions in the codegen where they assume the register does not consist of subregs anymore post-ra (where SIOptimizeExecMaskingPass fits in), So probably it's nice to have, but would not affect tests either way.

foad added inline comments.Feb 22 2022, 3:48 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
356

Physical register operands don't have subregs and MachineVerifier checks this, so please just remove the checks.

tsymalla marked 5 inline comments as done.Feb 22 2022, 4:21 AM
tsymalla marked 2 inline comments as done.Feb 22 2022, 4:24 AM
tsymalla updated this revision to Diff 410505.Feb 22 2022, 4:25 AM

Fixed shrinking of VOPCX instructions.
Removed superfluous subreg checks.
Addressed review comments.

tsymalla updated this revision to Diff 410510.Feb 22 2022, 4:50 AM

Newline fix.

nhaehnle added inline comments.Feb 22 2022, 5:40 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
362–363

As far as I recall, the stall between v_cmp and s_and_saveexec is quite long, so

v_cmp
s_mov
v_cmpx

is probably faster than

v_cmp
s_and_saveexec

and it’s worth to do this transformation and keep the v_cmp around.
@foad?

IIRC the VALU->SALU stall is a simple conservative stall based on outstanding SGPR writes from the VALU. The v_cmp -> s_mov in the first snippet would already have the same stall as v_cmp -> s_and_saveexec, whether it uses the result of the v_cmp or not. So it's better to choose the code sequence with s_and_saveexec.

Can somebody please take a look again?

foad added inline comments.Feb 25 2022, 3:36 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3827–3829

Testing MI.getOperand(0).isReg() is ugly here. It happens to work, because operand 0 of a nosdst cmpx e64 instruction is src0_modifiers which is an immediate. But really I think this code ought to say:

if MI has vdst
  copy MI.vdst to Inst32
[else] if MI has sdst
  assert that MI.sdst is vcc (so there is no need to copy it, because the Inst32 will implicitly use vcc)
foad added inline comments.Feb 25 2022, 3:46 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
344–345

The code just finds the last v_cmp instruction. Then you hope that it defs the saveexec input operand? Do you ever actually check that?

374

I don't think you need this check (all vcmp instructions have src0 and src1 operands, don't they?).

377

You could have checked for Exec in the first call to findInstrBackwards, just to bail out a bit earlier.

tsymalla marked 4 inline comments as done.Feb 25 2022, 6:15 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
344–345

You are correct. I included this check previously but somehow accidentally deleted it during review changes. Good find. Thank you.

tsymalla updated this revision to Diff 411399.Feb 25 2022, 6:15 AM
tsymalla marked an inline comment as done.

Addressing review comments: adding missing v_cmp dest check.
Other minor improvements.

@Flakebi @foad What are your thoughts on this? It seems like the matching is constrained enough so it will only be applied in certain cases.

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 11:44 PM

Looks good to me

sebastian-ne accepted this revision.Mar 3 2022, 12:35 AM
This revision is now accepted and ready to land.Mar 3 2022, 12:35 AM
tsymalla updated this revision to Diff 412667.Mar 3 2022, 3:57 AM

Resolve merge conflicts.

tsymalla updated this revision to Diff 414619.Mar 11 2022, 2:09 AM

Adding multiple checks to prevent generating incorrect code sequences.

Can you add lit tests for the fixes you made please?

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
325

Typo: an register, an basic block, an specific instruction

I was a bit confused by the comment, maybe calling the function isRegisterLiveAfter(MachineInstr &Stop, MCRegister Reg, …) is more clear? (The MBB can be fetched from Stop.getParent())

I think this function checks if Reg is still live/used after Stop?

333–335

I think a reverse iterator iterates from MBB.rbegin() to MBB.rend() (I’m always confused by this, but it’s used in SIOptimizeExecMaskingPreRA.cpp:412 in this way).

379

Does this also need the TRI?

critson requested changes to this revision.Mar 15 2022, 12:11 AM

Need to fix missing TRI for modifiesRegister calls.
Also please add a test of the form:

$sgpr0 = S_BFE_U32 killed renamable $sgpr3, 524296, implicit-def dead $scc
$vcc = V_CMP_GT_U32_e64 killed $sgpr0, killed $vgpr0, implicit $exec
$sgpr0_sgpr1 = COPY $exec, implicit-def $exec
$sgpr0_sgpr1 = S_AND_B64 killed renamable $sgpr0_sgpr1, killed renamable $vcc, implicit-def dead $scc
$exec = S_MOV_B64_term killed renamable $sgpr0_sgpr

Where V_CMP source is subregister of saved EXEC mask, which can happen in wave64.

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
379

modifiesRegister must be passed TRI otherwise subregisters will not be checked.

384

Here too.

This revision now requires changes to proceed.Mar 15 2022, 12:11 AM
tsymalla marked 5 inline comments as done.Mar 15 2022, 3:56 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
333–335

Good catch. rbegin() == end(), while rend() == begin().

tsymalla updated this revision to Diff 415784.Mar 16 2022, 4:55 AM
tsymalla marked an inline comment as done.

Fixed subregister handling.
Added some more tests.

@critson @sebastian-ne Would you mind taking a look again? Adding the TRI argument should have fixed the remaining issues.

Looks good, thanks, just left some small comments.

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
325

I think the comment is not completely correct.
The function checks if Reg is in use after Stop. If Reg is dead after Stop but still defined and used afterwards, this function will still return false (because Reg is not in use directly after stop).

The fact that the function uses backwards iteration to find the set of live registers can be considered an implementation detail.

333–335

I think the first ++A is wrong, rbegin() should already be a valid iterator?

llvm/test/CodeGen/AMDGPU/vcmp-saveexec-to-vcmpx.ll
134

Nit: There’s a newline missing at the end of the file.

llvm/test/CodeGen/AMDGPU/vcmp-saveexec-to-vcmpx.mir
1 ↗(On Diff #415784)

This test does not look autogenerated?
Can you add a short comment that explains what this test does? (The test and function name contain vcmpx, but the code does not use v_cmpx, which makes it non-obvious.)

tsymalla updated this revision to Diff 415811.Mar 16 2022, 6:48 AM
tsymalla marked 4 inline comments as done.

Addressed review comments.

Addressed review comments.

tsymalla added inline comments.Mar 16 2022, 6:51 AM
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
333–335

I assume this is not exactly wrong, as we don't care about defines at the end of the basic block in this case, but it shouldn't matter either.

@critson Would you mind taking a look again please? Thanks

critson accepted this revision.Mar 20 2022, 7:54 PM

LGTM, with a few minor nits.

Also please double check all comments in the code are up to date, as there has been a lot revisions.
Thanks!

llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
298

Comment mentions code that no longer exists DisallowDefBetween?

649

Could this be more future proof?
e.g.
if (AMDGPU::isGFX10Plus(ST) && !ST.hasVcmpxExecWARHazard())

655

It's unfortunate that we have to scan all instructions to find these.
Normally they should be within an instruction or two of the terminators.
However, I can see that the waterfall code violates the principle of EXEC manipulation at the end of block right now, so we don't have a choice.

This revision is now accepted and ready to land.Mar 20 2022, 7:54 PM
tsymalla marked 2 inline comments as done.Mar 21 2022, 1:29 AM
tsymalla added inline comments.
llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp
649

I am not sure if the conditions check would be correct, as I don't know the logic behind setting the value returned by hasVcmpxExecWARHazard(). We don't want to have this transformation enabled on either GFX10.1 or GFX10.2 - so for now I'm leaving it as it is - we can change that later perhaps.

tsymalla updated this revision to Diff 416840.Mar 21 2022, 1:30 AM
tsymalla marked an inline comment as done.

Changes based on the review comments (only comments, so NFC).

This revision was landed with ongoing or failed builds.Mar 21 2022, 1:32 AM
This revision was automatically updated to reflect the committed changes.