This is an archive of the discontinued LLVM Phabricator instance.

Avoid predicated execution of the basic blocks containing scalar instructions
ClosedPublic

Authored by alex-t on Sep 26 2017, 2:03 PM.

Details

Summary

V_READFIRSTLANE and V_READLANE produce scalar result from vector operand. Scalar instructions consuming the result should not be executed if exec mask is zero.

Tests passed : make llvm-check

Diff Detail

Event Timeline

alex-t created this revision.Sep 26 2017, 2:03 PM
alex-t edited the summary of this revision. (Show Details)Sep 26 2017, 2:18 PM
rampitec edited edge metadata.Sep 26 2017, 8:59 PM

Test is needed.

lib/Target/AMDGPU/SIInsertSkips.cpp
137

The comment is misleading. Scalar instructions executed even if exec = 0 (contrarily to the comment). That is unclear if there must be a scalar instruction consuming result of readlane too, since SGPR can be an operand of a vector instruction.

alex-t updated this revision to Diff 117363.EditedOct 2 2017, 9:17 AM
  1. Added the code that checks that scalar register produced by V_READFIRSTLANE/V_READLANE

Is really used by some SALU instruction. This is necessary to avoid de-optimization of those cases when the scalar register is distinctly the scalar operand of vector instruction.

  1. MIR test added
rampitec added inline comments.Oct 2 2017, 10:22 AM
lib/Target/AMDGPU/SIInsertSkips.cpp
145

What if an user is actually hoisted out the block? What if that is terminator uses it? In both cases users should then read undefined data.

148

What is interesting, an user can be also another v_readlane_b32's lane select operand. Subsequent v_readlane_b32 would be executed even with exec = 0 and will read an undefined data.

test/CodeGen/AMDGPU/readlane_exec0.mir
1

Please use -check-label=GCN and GCN labels instead of CHECK. This seems to be our new convention.
Also add -verify-machineinstrs.

3

Label in this case needs to be function's name. That is to add new function to the test later.
bb.0 then becomes a regular check.

alex-t updated this revision to Diff 117416.Oct 2 2017, 1:49 PM

It is really make sense to take care of the V_READFIRSTLANE/V_READLANE destination register under exec == 0 condition in case their source VGPR is re-defined in SI_MASK_BRANCH target block. Otherwise we assume that source VGPR is defined in one of the dominating blocks and contain correct value.

rampitec added inline comments.Oct 2 2017, 2:08 PM
lib/Target/AMDGPU/SIInsertSkips.cpp
145

It is named operand "src0", using getOperand(1) is not desirable.
Also VReg is misleading, it reads like "virtual register".

156

Post increment is broken here.

157

Please follow the general and surrounding style: brace on the same line with expression.

175

That is broken. Remember that register is physical.
Not speaking that is whole function scan is expensive.

test/CodeGen/AMDGPU/readlane_exec0.mir
4

You still need to add GCN-LABEL

I would just bail on any of these instructions not trying to optimize the case, just like you did in the beginning.

arsenm added a comment.Oct 2 2017, 4:10 PM

I want to replace this pass by always inserting the branches on execz, and have a new pass which optimizes out short jumps. Would that be easier than trying to analyze this?

alex-t updated this revision to Diff 117544.Oct 3 2017, 10:06 AM

Reverted back to the simplest approach.

This revision is now accepted and ready to land.Oct 3 2017, 10:20 AM
This revision was automatically updated to reflect the committed changes.
alex-t reopened this revision.Oct 17 2017, 12:03 PM
This comment was removed by alex-t.
This revision is now accepted and ready to land.Oct 17 2017, 12:03 PM
alex-t closed this revision.Oct 17 2017, 12:23 PM

r314828