This is an archive of the discontinued LLVM Phabricator instance.

Avoid predicated execution of the basic blocks containing scalar instructions
ClosedPublic

Authored by alex-t on Sep 26 2017, 2:03 PM.

Details

Summary

V_READFIRSTLANE and V_READLANE produce scalar result from vector operand. Scalar instructions consuming the result should not be executed if exec mask is zero.

Tests passed : make llvm-check

Diff Detail

Repository
rL LLVM

Event Timeline

alex-t created this revision.Sep 26 2017, 2:03 PM
alex-t edited the summary of this revision. (Show Details)Sep 26 2017, 2:18 PM
rampitec edited edge metadata.Sep 26 2017, 8:59 PM

Test is needed.

lib/Target/AMDGPU/SIInsertSkips.cpp
137 ↗(On Diff #116708)

The comment is misleading. Scalar instructions executed even if exec = 0 (contrarily to the comment). That is unclear if there must be a scalar instruction consuming result of readlane too, since SGPR can be an operand of a vector instruction.

alex-t updated this revision to Diff 117363.EditedOct 2 2017, 9:17 AM
  1. Added the code that checks that scalar register produced by V_READFIRSTLANE/V_READLANE

Is really used by some SALU instruction. This is necessary to avoid de-optimization of those cases when the scalar register is distinctly the scalar operand of vector instruction.

  1. MIR test added
rampitec added inline comments.Oct 2 2017, 10:22 AM
lib/Target/AMDGPU/SIInsertSkips.cpp
145 ↗(On Diff #117363)

What if an user is actually hoisted out the block? What if that is terminator uses it? In both cases users should then read undefined data.

148 ↗(On Diff #117363)

What is interesting, an user can be also another v_readlane_b32's lane select operand. Subsequent v_readlane_b32 would be executed even with exec = 0 and will read an undefined data.

test/CodeGen/AMDGPU/readlane_exec0.mir
1 ↗(On Diff #117363)

Please use -check-label=GCN and GCN labels instead of CHECK. This seems to be our new convention.
Also add -verify-machineinstrs.

3 ↗(On Diff #117363)

Label in this case needs to be function's name. That is to add new function to the test later.
bb.0 then becomes a regular check.

alex-t updated this revision to Diff 117416.Oct 2 2017, 1:49 PM

It is really make sense to take care of the V_READFIRSTLANE/V_READLANE destination register under exec == 0 condition in case their source VGPR is re-defined in SI_MASK_BRANCH target block. Otherwise we assume that source VGPR is defined in one of the dominating blocks and contain correct value.

rampitec added inline comments.Oct 2 2017, 2:08 PM
lib/Target/AMDGPU/SIInsertSkips.cpp
145 ↗(On Diff #117416)

It is named operand "src0", using getOperand(1) is not desirable.
Also VReg is misleading, it reads like "virtual register".

156 ↗(On Diff #117416)

Post increment is broken here.

157 ↗(On Diff #117416)

Please follow the general and surrounding style: brace on the same line with expression.

175 ↗(On Diff #117416)

That is broken. Remember that register is physical.
Not speaking that is whole function scan is expensive.

test/CodeGen/AMDGPU/readlane_exec0.mir
3 ↗(On Diff #117416)

You still need to add GCN-LABEL

I would just bail on any of these instructions not trying to optimize the case, just like you did in the beginning.

arsenm added a comment.Oct 2 2017, 4:10 PM

I want to replace this pass by always inserting the branches on execz, and have a new pass which optimizes out short jumps. Would that be easier than trying to analyze this?

alex-t updated this revision to Diff 117544.Oct 3 2017, 10:06 AM

Reverted back to the simplest approach.

This revision is now accepted and ready to land.Oct 3 2017, 10:20 AM
This revision was automatically updated to reflect the committed changes.
alex-t reopened this revision.Oct 17 2017, 12:03 PM
This comment was removed by alex-t.
This revision is now accepted and ready to land.Oct 17 2017, 12:03 PM
alex-t closed this revision.Oct 17 2017, 12:23 PM

r314828