This is an archive of the discontinued LLVM Phabricator instance.

[GlobalISel][AMDGPU] Add dead code elimination clean up after legalization.
AbandonedPublic

Authored by aemerson on Sep 15 2021, 3:15 PM.

Details

Diff Detail

Event Timeline

aemerson created this revision.Sep 15 2021, 3:15 PM
aemerson requested review of this revision.Sep 15 2021, 3:15 PM
arsenm added inline comments.Sep 15 2021, 4:39 PM
llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
50–53

Why do these instructions not get deleted as-is? I think this is a very un-discoverable option and if it's worth deleting these instructions, it's worth just always doing it

aemerson added inline comments.Sep 15 2021, 4:50 PM
llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
50–53

Maybe if their users have their operands changed to use another def, they might not get readded to the legaliser/arrogant worklist if they’ve already been legalised.

foad added a comment.Sep 16 2021, 2:38 AM

Why do these instructions not get deleted as-is?

Taking test_fcos_v3s16 from legalize-fcos.mir as an example, in the first iteration of legalization, after processing InstList but before processing ArtifactList we have:

bb.0:
  %13:_(<4 x s16>) = G_IMPLICIT_DEF
  %14:_(<4 x s16>) = G_IMPLICIT_DEF
  %15:_(<12 x s16>) = G_CONCAT_VECTORS %13:_(<4 x s16>), %14:_(<4 x s16>), %14:_(<4 x s16>)
  %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>)
  %3:_(s16), %4:_(s16), %5:_(s16) = G_UNMERGE_VALUES %0:_(<3 x s16>)
  %9:_(s16) = G_FCONSTANT half 0xH3118
  %12:_(s16) = G_FMUL %3:_, %9:_
  %6:_(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.cos), %12:_(s16)
  %11:_(s16) = G_FMUL %4:_, %9:_
  %7:_(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.cos), %11:_(s16)
  %10:_(s16) = G_FMUL %5:_, %9:_
  %8:_(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.cos), %10:_(s16)
  %1:_(<3 x s16>) = G_BUILD_VECTOR %6:_(s16), %7:_(s16), %8:_(s16)
  %2:_(<3 x s32>) = G_ANYEXT %1:_(<3 x s16>)
  S_NOP 0, implicit %2:_(<3 x s32>)

But ArtifactList doesn't seem to be in any particular order because they get processed like this:

Trying to combine: %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>)
.. Not combined, moving to instructions list
Trying to combine: %15:_(<12 x s16>) = G_CONCAT_VECTORS %13:_(<4 x s16>), %14:_(<4 x s16>), %14:_(<4 x s16>)
.. .. Changed MI: %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>)
CSEInfo::Recording new MI %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>)
.. Not combined, moving to instructions list
Trying to combine: %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>)
.. Not combined, moving to instructions list
Trying to combine: %1:_(<3 x s16>) = G_BUILD_VECTOR %6:_(s16), %7:_(s16), %8:_(s16)
.. Not combined, moving to instructions list
Trying to combine: %3:_(s16), %4:_(s16), %5:_(s16) = G_UNMERGE_VALUES %0:_(<3 x s16>)
[...]

%3 does combine, which leaves %0 dead and it gets erased. But %15 does not get revisited after that.

If it tried to combine them in the order %3, %0, %15 it would spot that %15 is trivially dead.

I don't know how this is supposed to work. I can see that the initial InstList is populated so that we legalize in a strict bottom-up order, but that doesn't seem to work for ArtifactList.

foad added a comment.Sep 16 2021, 3:06 AM

D109154 was another attempt at fixing the same kind of problem.

Most common case where we end up with dead instrs in output is unmerge with dead deaf
legalize unmerge using lower(unmerge is in inst list):
Legalizing: %36:_(s16), %37:_(s16) = G_UNMERGE_VALUES %17:_(<2 x s16>)
%37 is dead here, lower creates instrs and artifacts but we are now going through instrlist
.. .. New MI: %49:_(s32) = G_BITCAST %17:_(<2 x s16>)
.. .. New MI: %36:_(s16) = G_TRUNC %49:_(s32)
.. .. New MI: %50:_(s32) = G_LSHR %49:_, %47:_(s32)
.. .. New MI: %37:_(s16) = G_TRUNC %50:_(s32)

Keep legalizing instrs: %50:_(s32) = G_LSHR %49:_, %47:_(s32) not dead, used by trunc, also legal and is now removed from all lists.
Move to artifactlist legalization
get to trunc: dead (%37), erase it, %50:_(s32) = G_LSHR %49:_, %47:_(s32) is now dead but is no longer in any list.
We can't really rely on going through instructions in perfect order.

I find D109154 to be extension of the way we re-insert artifacts into the artifactlist (they are in instlist at that moment) in artifact combiner (while (!UpdatedDefs.empty()) loop ) where we go through all users of UpdatedDefs and retry combines for users that are artifacts.
D109154 similar thing only when instruction from some list gets deleted. It does not seem to me it would significantly affects performance and that we need dce as a flag. If it turns out to be expensive could we get dce by default and turn it off for -O0?

Most common case where we end up with dead instrs in output is unmerge with dead deaf
legalize unmerge using lower(unmerge is in inst list):
Legalizing: %36:_(s16), %37:_(s16) = G_UNMERGE_VALUES %17:_(<2 x s16>)
%37 is dead here, lower creates instrs and artifacts but we are now going through instrlist
.. .. New MI: %49:_(s32) = G_BITCAST %17:_(<2 x s16>)
.. .. New MI: %36:_(s16) = G_TRUNC %49:_(s32)
.. .. New MI: %50:_(s32) = G_LSHR %49:_, %47:_(s32)
.. .. New MI: %37:_(s16) = G_TRUNC %50:_(s32)

Keep legalizing instrs: %50:_(s32) = G_LSHR %49:_, %47:_(s32) not dead, used by trunc, also legal and is now removed from all lists.
Move to artifactlist legalization
get to trunc: dead (%37), erase it, %50:_(s32) = G_LSHR %49:_, %47:_(s32) is now dead but is no longer in any list.
We can't really rely on going through instructions in perfect order.

I find D109154 to be extension of the way we re-insert artifacts into the artifactlist (they are in instlist at that moment) in artifact combiner (while (!UpdatedDefs.empty()) loop ) where we go through all users of UpdatedDefs and retry combines for users that are artifacts.
D109154 similar thing only when instruction from some list gets deleted. It does not seem to me it would significantly affects performance and that we need dce as a flag. If it turns out to be expensive could we get dce by default and turn it off for -O0?

Why don't we just do this clean up unconditionally as a once-over pass? That way we don't have to complicate the logic any further (I'm also not a fan of calling things like MI.uses() just for checking dead instructions.

aemerson updated this revision to Diff 373111.Sep 16 2021, 5:09 PM
aemerson retitled this revision from [GlobalISel][AMDGPU] Add a -final-dce-legalizer flag to clean up dead code after legalization. to [GlobalISel][AMDGPU] Add dead code elimination clean up after legalization..
aemerson edited the summary of this revision. (Show Details)

I've now made this run after legalization unconditionally.

foad added a comment.Sep 17 2021, 1:36 AM

What is the compile time impact of this vs D109154? I'm assuming they are both equally effective at cleaning up the dead code.

If we're going to go with this approach, should we clean up legalizeMachineFunction by ripping out all the bottom-up RPO order stuff, and the on-the-fly (presumably expensive) isTriviallyDead calls?

If we're going to go with this approach, should we clean up legalizeMachineFunction by ripping out all the bottom-up RPO order stuff, and the on-the-fly (presumably expensive) isTriviallyDead calls?

That's worth looking into as a later piece of work. It could plausibly also make things worse if we don't recycle memory that would have otherwise been freed during the legalization process.

What is the compile time impact of this vs D109154? I'm assuming they are both equally effective at cleaning up the dead code.

If we're going to go with this approach, should we clean up legalizeMachineFunction by ripping out all the bottom-up RPO order stuff, and the on-the-fly (presumably expensive) isTriviallyDead calls?

CTMark results for this patch AArch64 using 5 runs:

O0 -g:

Program             base   full-dce diff 
 pairlocalalign       1.56   1.57    1.0%
 kc                   8.70   8.70    0.0%
 bullet              14.03  14.02   -0.1%
 lencod               3.03   3.02   -0.3%
 clamscan             3.81   3.80   -0.3%
 consumer-typeset     2.79   2.78   -0.3%
 tramp3d-v4           5.71   5.69   -0.3%
 SPASS                3.37   3.36   -0.4%
 sqlite3              1.08   1.08   -0.4%
 7zip-benchmark      18.00  17.91   -0.4%
 Geomean difference                 -0.2%

Os -g:

Program             base   full-dce diff 
 bullet              23.96  23.99    0.1%
 lencod              10.03  10.04    0.1%
 tramp3d-v4          16.85  16.84   -0.0%
 7zip-benchmark      30.93  30.90   -0.1%
 kc                  14.95  14.94   -0.1%
 SPASS               11.03  11.01   -0.2%
 clamscan            10.71  10.68   -0.2%
 pairlocalalign       5.80   5.79   -0.3%
 sqlite3              6.75   6.72   -0.4%
 consumer-typeset     8.46   8.39   -0.8%
 Geomean difference                 -0.2%

So we actually see a minor improvement by cleaning up. Not sure why that is.

As for D109154:

-O0 -g:
Program             base   partial-dce diff 
 sqlite3              1.08   1.09       0.1%
 bullet              14.03  14.04       0.1%
 kc                   8.70   8.68      -0.2%
 pairlocalalign       1.56   1.55      -0.2%
 7zip-benchmark      18.00  17.95      -0.3%
 lencod               3.03   3.02      -0.4%
 clamscan             3.81   3.78      -0.6%
 SPASS                3.37   3.35      -0.6%
 tramp3d-v4           5.71   5.67      -0.7%
 consumer-typeset     2.79   2.76      -1.1%
 Geomean difference                    -0.4%

-Os -g:
Program             base   partial-dce diff 
 SPASS               11.03  11.03       0.0%
 bullet              23.96  23.97       0.0%
 sqlite3              6.75   6.75      -0.1%
 7zip-benchmark      30.93  30.90      -0.1%
 tramp3d-v4          16.85  16.82      -0.1%
 consumer-typeset     8.46   8.43      -0.2%
 pairlocalalign       5.80   5.79      -0.3%
 kc                  14.95  14.90      -0.3%
 lencod              10.03   9.99      -0.4%
 clamscan            10.71  10.66      -0.5%
 Geomean difference                    -0.2%

Same thing for D109154, but a slightly bigger improvement at -O0.

I measured the CT when we only delete instructions at the end, and instead of doing an RPOT walk we just process the blocks in sequential order:

O0 -g:
Program             base   final-dce-only diff 
 bullet              14.03  14.06          0.2%
 sqlite3              1.08   1.09          0.1%
 lencod               3.03   3.03          0.1%
 tramp3d-v4           5.71   5.71         -0.0%
 clamscan             3.81   3.80         -0.2%
 kc                   8.70   8.68         -0.3%
 7zip-benchmark      18.00  17.94         -0.3%
 SPASS                3.37   3.36         -0.4%
 pairlocalalign       1.56   1.55         -0.5%
 consumer-typeset     2.79   2.77         -0.7%
 Geomean difference                       -0.2%

-Os -g:
Program             base   final-dce-only diff 
 SPASS               11.03  11.08          0.4%
 sqlite3              6.75   6.77          0.3%
 clamscan            10.71  10.72          0.2%
 pairlocalalign       5.80   5.81          0.1%
 bullet              23.96  23.98          0.1%
 lencod              10.03  10.04          0.0%
 7zip-benchmark      30.93  30.94          0.0%
 tramp3d-v4          16.85  16.81         -0.2%
 kc                  14.95  14.89         -0.4%
 consumer-typeset     8.46   8.41         -0.5%
 Geomean difference                        0.0%

It looks like a wash, a little bit worse than before.

Overall, I think the data is showing that D109154 is most preferable for compile time improvements, especially at -O0 where compile time is more valuable. @Petar.Avramovic let's move ahead with your patch then?

foad added a comment.Sep 20 2021, 1:09 AM

Thanks for all the compile time testing!

aemerson abandoned this revision.Sep 20 2021, 11:17 AM