Details
- Reviewers
arsenm foad Petar.Avramovic
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
830 ms | x64 debian > libomp.lock::omp_init_lock.c |
Event Timeline
llvm/lib/CodeGen/GlobalISel/Legalizer.cpp | ||
---|---|---|
50–53 | Why do these instructions not get deleted as-is? I think this is a very un-discoverable option and if it's worth deleting these instructions, it's worth just always doing it |
llvm/lib/CodeGen/GlobalISel/Legalizer.cpp | ||
---|---|---|
50–53 | Maybe if their users have their operands changed to use another def, they might not get readded to the legaliser/arrogant worklist if they’ve already been legalised. |
Why do these instructions not get deleted as-is?
Taking test_fcos_v3s16 from legalize-fcos.mir as an example, in the first iteration of legalization, after processing InstList but before processing ArtifactList we have:
bb.0: %13:_(<4 x s16>) = G_IMPLICIT_DEF %14:_(<4 x s16>) = G_IMPLICIT_DEF %15:_(<12 x s16>) = G_CONCAT_VECTORS %13:_(<4 x s16>), %14:_(<4 x s16>), %14:_(<4 x s16>) %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>) %3:_(s16), %4:_(s16), %5:_(s16) = G_UNMERGE_VALUES %0:_(<3 x s16>) %9:_(s16) = G_FCONSTANT half 0xH3118 %12:_(s16) = G_FMUL %3:_, %9:_ %6:_(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.cos), %12:_(s16) %11:_(s16) = G_FMUL %4:_, %9:_ %7:_(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.cos), %11:_(s16) %10:_(s16) = G_FMUL %5:_, %9:_ %8:_(s16) = G_INTRINSIC intrinsic(@llvm.amdgcn.cos), %10:_(s16) %1:_(<3 x s16>) = G_BUILD_VECTOR %6:_(s16), %7:_(s16), %8:_(s16) %2:_(<3 x s32>) = G_ANYEXT %1:_(<3 x s16>) S_NOP 0, implicit %2:_(<3 x s32>)
But ArtifactList doesn't seem to be in any particular order because they get processed like this:
Trying to combine: %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>) .. Not combined, moving to instructions list Trying to combine: %15:_(<12 x s16>) = G_CONCAT_VECTORS %13:_(<4 x s16>), %14:_(<4 x s16>), %14:_(<4 x s16>) .. .. Changed MI: %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>) CSEInfo::Recording new MI %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>) .. Not combined, moving to instructions list Trying to combine: %0:_(<3 x s16>), %16:_(<3 x s16>), %17:_(<3 x s16>), %18:_(<3 x s16>) = G_UNMERGE_VALUES %15:_(<12 x s16>) .. Not combined, moving to instructions list Trying to combine: %1:_(<3 x s16>) = G_BUILD_VECTOR %6:_(s16), %7:_(s16), %8:_(s16) .. Not combined, moving to instructions list Trying to combine: %3:_(s16), %4:_(s16), %5:_(s16) = G_UNMERGE_VALUES %0:_(<3 x s16>) [...]
%3 does combine, which leaves %0 dead and it gets erased. But %15 does not get revisited after that.
If it tried to combine them in the order %3, %0, %15 it would spot that %15 is trivially dead.
I don't know how this is supposed to work. I can see that the initial InstList is populated so that we legalize in a strict bottom-up order, but that doesn't seem to work for ArtifactList.
Most common case where we end up with dead instrs in output is unmerge with dead deaf
legalize unmerge using lower(unmerge is in inst list):
Legalizing: %36:_(s16), %37:_(s16) = G_UNMERGE_VALUES %17:_(<2 x s16>)
%37 is dead here, lower creates instrs and artifacts but we are now going through instrlist
.. .. New MI: %49:_(s32) = G_BITCAST %17:_(<2 x s16>)
.. .. New MI: %36:_(s16) = G_TRUNC %49:_(s32)
.. .. New MI: %50:_(s32) = G_LSHR %49:_, %47:_(s32)
.. .. New MI: %37:_(s16) = G_TRUNC %50:_(s32)
Keep legalizing instrs: %50:_(s32) = G_LSHR %49:_, %47:_(s32) not dead, used by trunc, also legal and is now removed from all lists.
Move to artifactlist legalization
get to trunc: dead (%37), erase it, %50:_(s32) = G_LSHR %49:_, %47:_(s32) is now dead but is no longer in any list.
We can't really rely on going through instructions in perfect order.
I find D109154 to be extension of the way we re-insert artifacts into the artifactlist (they are in instlist at that moment) in artifact combiner (while (!UpdatedDefs.empty()) loop ) where we go through all users of UpdatedDefs and retry combines for users that are artifacts.
D109154 similar thing only when instruction from some list gets deleted. It does not seem to me it would significantly affects performance and that we need dce as a flag. If it turns out to be expensive could we get dce by default and turn it off for -O0?
Why don't we just do this clean up unconditionally as a once-over pass? That way we don't have to complicate the logic any further (I'm also not a fan of calling things like MI.uses() just for checking dead instructions.
What is the compile time impact of this vs D109154? I'm assuming they are both equally effective at cleaning up the dead code.
If we're going to go with this approach, should we clean up legalizeMachineFunction by ripping out all the bottom-up RPO order stuff, and the on-the-fly (presumably expensive) isTriviallyDead calls?
If we're going to go with this approach, should we clean up legalizeMachineFunction by ripping out all the bottom-up RPO order stuff, and the on-the-fly (presumably expensive) isTriviallyDead calls?
That's worth looking into as a later piece of work. It could plausibly also make things worse if we don't recycle memory that would have otherwise been freed during the legalization process.
CTMark results for this patch AArch64 using 5 runs:
O0 -g: Program base full-dce diff pairlocalalign 1.56 1.57 1.0% kc 8.70 8.70 0.0% bullet 14.03 14.02 -0.1% lencod 3.03 3.02 -0.3% clamscan 3.81 3.80 -0.3% consumer-typeset 2.79 2.78 -0.3% tramp3d-v4 5.71 5.69 -0.3% SPASS 3.37 3.36 -0.4% sqlite3 1.08 1.08 -0.4% 7zip-benchmark 18.00 17.91 -0.4% Geomean difference -0.2% Os -g: Program base full-dce diff bullet 23.96 23.99 0.1% lencod 10.03 10.04 0.1% tramp3d-v4 16.85 16.84 -0.0% 7zip-benchmark 30.93 30.90 -0.1% kc 14.95 14.94 -0.1% SPASS 11.03 11.01 -0.2% clamscan 10.71 10.68 -0.2% pairlocalalign 5.80 5.79 -0.3% sqlite3 6.75 6.72 -0.4% consumer-typeset 8.46 8.39 -0.8% Geomean difference -0.2%
So we actually see a minor improvement by cleaning up. Not sure why that is.
As for D109154:
-O0 -g: Program base partial-dce diff sqlite3 1.08 1.09 0.1% bullet 14.03 14.04 0.1% kc 8.70 8.68 -0.2% pairlocalalign 1.56 1.55 -0.2% 7zip-benchmark 18.00 17.95 -0.3% lencod 3.03 3.02 -0.4% clamscan 3.81 3.78 -0.6% SPASS 3.37 3.35 -0.6% tramp3d-v4 5.71 5.67 -0.7% consumer-typeset 2.79 2.76 -1.1% Geomean difference -0.4% -Os -g: Program base partial-dce diff SPASS 11.03 11.03 0.0% bullet 23.96 23.97 0.0% sqlite3 6.75 6.75 -0.1% 7zip-benchmark 30.93 30.90 -0.1% tramp3d-v4 16.85 16.82 -0.1% consumer-typeset 8.46 8.43 -0.2% pairlocalalign 5.80 5.79 -0.3% kc 14.95 14.90 -0.3% lencod 10.03 9.99 -0.4% clamscan 10.71 10.66 -0.5% Geomean difference -0.2%
Same thing for D109154, but a slightly bigger improvement at -O0.
I measured the CT when we only delete instructions at the end, and instead of doing an RPOT walk we just process the blocks in sequential order:
O0 -g: Program base final-dce-only diff bullet 14.03 14.06 0.2% sqlite3 1.08 1.09 0.1% lencod 3.03 3.03 0.1% tramp3d-v4 5.71 5.71 -0.0% clamscan 3.81 3.80 -0.2% kc 8.70 8.68 -0.3% 7zip-benchmark 18.00 17.94 -0.3% SPASS 3.37 3.36 -0.4% pairlocalalign 1.56 1.55 -0.5% consumer-typeset 2.79 2.77 -0.7% Geomean difference -0.2% -Os -g: Program base final-dce-only diff SPASS 11.03 11.08 0.4% sqlite3 6.75 6.77 0.3% clamscan 10.71 10.72 0.2% pairlocalalign 5.80 5.81 0.1% bullet 23.96 23.98 0.1% lencod 10.03 10.04 0.0% 7zip-benchmark 30.93 30.94 0.0% tramp3d-v4 16.85 16.81 -0.2% kc 14.95 14.89 -0.4% consumer-typeset 8.46 8.41 -0.5% Geomean difference 0.0%
It looks like a wash, a little bit worse than before.
Overall, I think the data is showing that D109154 is most preferable for compile time improvements, especially at -O0 where compile time is more valuable. @Petar.Avramovic let's move ahead with your patch then?