Page MenuHomePhabricator

[CodeGenPrepare] The instruction to be sunk should be inserted before its user in a block
ClosedPublic

Authored by TiehuZhang on Aug 2 2021, 4:09 AM.

Details

Summary

In current implementation, the instruction to be sunk will be inserted before the target instruction without considering the def-use tree, which may case Instruction does not dominate all uses error. We need to choose a suitable location to insert
according to the use chain

Diff Detail

Event Timeline

TiehuZhang created this revision.Aug 2 2021, 4:09 AM
TiehuZhang requested review of this revision.Aug 2 2021, 4:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2021, 4:09 AM
TiehuZhang edited the summary of this revision. (Show Details)Aug 2 2021, 4:13 AM
dmgreen added a subscriber: dmgreen.Aug 2 2021, 4:46 AM

This happens because out of the chain of insert; shuffle only the insert needs to be sunk?
It may be better to remove the UI->getParent() == TargetBB from if (UI->getParent() == TargetBB || isa<PHINode>(UI)) and update InsertPoint in the loop as we go. It won't need to clone the instruction, but it can update InsertPoint and continue.

llvm/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions-1.ll
4–5 ↗(On Diff #363430)

This can be removed.

25 ↗(On Diff #363430)

Bugpoint has a habit of over-reducing test cases, introducing undef where they only make things worse and less maintainable in the longrun. Can you make the test not use any undef?

This happens because out of the chain of insert; shuffle only the insert needs to be sunk?
It may be better to remove the UI->getParent() == TargetBB from if (UI->getParent() == TargetBB || isa<PHINode>(UI)) and update InsertPoint in the loop as we go. It won't need to clone the instruction, but it can update InsertPoint and continue.

Hi, dmgreen, thanks for your review! But I don't quite make sense of your comment. The condition UI->getParent() == TargetBB is used to filter out some instructions already in targetBB to avoid the meaningless sinking, why can it be removed?

TiehuZhang updated this revision to Diff 363675.Aug 3 2021, 3:27 AM

This happens because out of the chain of insert; shuffle only the insert needs to be sunk?
It may be better to remove the UI->getParent() == TargetBB from if (UI->getParent() == TargetBB || isa<PHINode>(UI)) and update InsertPoint in the loop as we go. It won't need to clone the instruction, but it can update InsertPoint and continue.

Hi, dmgreen, thanks for your review! But I don't quite make sense of your comment. The condition UI->getParent() == TargetBB is used to filter out some instructions already in targetBB to avoid the meaningless sinking, why can it be removed?

shouldSinkOperands will return a chain of instructions, in this case starting at the mul it will return the shuffle and the insert, as those are the instruction it is profitable to sink. They are visited in reverse order in order to sink the last instruction first. The shuffle is already in the TargetBB, so doesn't need to be sunk (or cloned), but we still need to update the InsertPoint when sinking the insert, or else it will fail dominance checks.

It looks like there are some Arm tests failing with the current attempt to fix that. The instruction being sunk may have multiple uses, and moving before the first one we happen to find might not always work, it might not be the instruction that was originally returned by shouldSinkOperands.

Instead, I think it would be better to make sure we are updating InsertPoint as we go in the ToReplace loop. So we add instruction to ToReplace even if they are already in TargetDB, and in the ToReplace loop we check if the instruction is already in the correct BB and just update the InsertPoint if so.

llvm/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions-1.ll
7 ↗(On Diff #363675)

Can this test be added to one of the existing files? There is already a sink-free-instructions.ll test. Otherwise the test should preferably have a better name than sink-free-instructions-1.ll.

29 ↗(On Diff #363675)

This value is dead? Can it be returned instead?

TiehuZhang updated this revision to Diff 364461.Aug 5 2021, 7:30 AM
TiehuZhang edited the summary of this revision. (Show Details)
TiehuZhang updated this revision to Diff 364464.Aug 5 2021, 7:35 AM

This happens because out of the chain of insert; shuffle only the insert needs to be sunk?
It may be better to remove the UI->getParent() == TargetBB from if (UI->getParent() == TargetBB || isa<PHINode>(UI)) and update InsertPoint in the loop as we go. It won't need to clone the instruction, but it can update InsertPoint and continue.

Hi, dmgreen, thanks for your review! But I don't quite make sense of your comment. The condition UI->getParent() == TargetBB is used to filter out some instructions already in targetBB to avoid the meaningless sinking, why can it be removed?

shouldSinkOperands will return a chain of instructions, in this case starting at the mul it will return the shuffle and the insert, as those are the instruction it is profitable to sink. They are visited in reverse order in order to sink the last instruction first. The shuffle is already in the TargetBB, so doesn't need to be sunk (or cloned), but we still need to update the InsertPoint when sinking the insert, or else it will fail dominance checks.

It looks like there are some Arm tests failing with the current attempt to fix that. The instruction being sunk may have multiple uses, and moving before the first one we happen to find might not always work, it might not be the instruction that was originally returned by shouldSinkOperands.

Instead, I think it would be better to make sure we are updating InsertPoint as we go in the ToReplace loop. So we add instruction to ToReplace even if they are already in TargetDB, and in the ToReplace loop we check if the instruction is already in the correct BB and just update the InsertPoint if so.

@dmgreen, thanks very much! We can make use of condition UI->getParent() == TargetBB from if (UI->getParent() == TargetBB || isa<PHINode>(UI)) to update the insertPoint. Actually, the order of users obtained from users() may not follow instructions order in a basic block. So the previous patch was not correct.

llvm/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions-1.ll
25 ↗(On Diff #363430)

Thanks for you comment. The case is reduced from a csmith testcase, so there are some strange instructions. I've made some changes to remove the undef.

TiehuZhang updated this revision to Diff 364468.Aug 5 2021, 7:52 AM
TiehuZhang updated this revision to Diff 364473.Aug 5 2021, 7:55 AM

Thanks, looking good. But I do still worry about the order of instructions sunk.

I was trying it out, seeing if it would go wrong when we were sinking a lot of operands. I noticed that the add/sub sinking wasn't really working properly though! There is https://reviews.llvm.org/D107623 to improve that and getting the shuffles to sink.

With that in, can you add these two test to show partially sinking two values at the same time:

define <4 x i32> @sinkadd_partial(<8 x i16> %a1, <8 x i16> %a2, i8 %f) {
for.cond4.preheader.lr.ph:
  %cmp = icmp slt i8 %f, 0
  %s2 = shufflevector <8 x i16> %a2, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %s1 = shufflevector <8 x i16> %a1, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  br i1 %cmp, label %for.cond4.preheader.us.preheader, label %for.cond4.preheader.preheader

for.cond4.preheader.us.preheader:                 ; preds = %for.cond4.preheader.lr.ph
  %e1 = sext <4 x i16> %s1 to <4 x i32>
  %e2 = sext <4 x i16> %s2 to <4 x i32>
  %0 = add <4 x i32> %e1, %e2
  ret <4 x i32> %0

for.cond4.preheader.preheader:                    ; preds = %for.cond4.preheader.lr.ph
  ret <4 x i32> zeroinitializer
}

define <4 x i32> @sinkadd_partial_rev(<8 x i16> %a1, <8 x i16> %a2, i8 %f) {
for.cond4.preheader.lr.ph:
  %cmp = icmp slt i8 %f, 0
  %s2 = shufflevector <8 x i16> %a2, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %s1 = shufflevector <8 x i16> %a1, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  br i1 %cmp, label %for.cond4.preheader.us.preheader, label %for.cond4.preheader.preheader

for.cond4.preheader.us.preheader:                 ; preds = %for.cond4.preheader.lr.ph
  %e2 = sext <4 x i16> %s2 to <4 x i32>
  %e1 = sext <4 x i16> %s1 to <4 x i32>
  %0 = add <4 x i32> %e1, %e2
  ret <4 x i32> %0

for.cond4.preheader.preheader:                    ; preds = %for.cond4.preheader.lr.ph
  ret <4 x i32> zeroinitializer
}

The order of extends in the target block become important.

Thanks, looking good. But I do still worry about the order of instructions sunk.

I was trying it out, seeing if it would go wrong when we were sinking a lot of operands. I noticed that the add/sub sinking wasn't really working properly though! There is https://reviews.llvm.org/D107623 to improve that and getting the shuffles to sink.

With that in, can you add these two test to show partially sinking two values at the same time:

define <4 x i32> @sinkadd_partial(<8 x i16> %a1, <8 x i16> %a2, i8 %f) {
for.cond4.preheader.lr.ph:
  %cmp = icmp slt i8 %f, 0
  %s2 = shufflevector <8 x i16> %a2, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %s1 = shufflevector <8 x i16> %a1, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  br i1 %cmp, label %for.cond4.preheader.us.preheader, label %for.cond4.preheader.preheader

for.cond4.preheader.us.preheader:                 ; preds = %for.cond4.preheader.lr.ph
  %e1 = sext <4 x i16> %s1 to <4 x i32>
  %e2 = sext <4 x i16> %s2 to <4 x i32>
  %0 = add <4 x i32> %e1, %e2
  ret <4 x i32> %0

for.cond4.preheader.preheader:                    ; preds = %for.cond4.preheader.lr.ph
  ret <4 x i32> zeroinitializer
}

define <4 x i32> @sinkadd_partial_rev(<8 x i16> %a1, <8 x i16> %a2, i8 %f) {
for.cond4.preheader.lr.ph:
  %cmp = icmp slt i8 %f, 0
  %s2 = shufflevector <8 x i16> %a2, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %s1 = shufflevector <8 x i16> %a1, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  br i1 %cmp, label %for.cond4.preheader.us.preheader, label %for.cond4.preheader.preheader

for.cond4.preheader.us.preheader:                 ; preds = %for.cond4.preheader.lr.ph
  %e2 = sext <4 x i16> %s2 to <4 x i32>
  %e1 = sext <4 x i16> %s1 to <4 x i32>
  %0 = add <4 x i32> %e1, %e2
  ret <4 x i32> %0

for.cond4.preheader.preheader:                    ; preds = %for.cond4.preheader.lr.ph
  ret <4 x i32> zeroinitializer
}

The order of extends in the target block become important.

OK, I will, thanks!

TiehuZhang updated this revision to Diff 365144.Aug 9 2021, 4:00 AM

I've committed 013030a0b213a75e0403fcdb5a070d21831ee561. Do you want to rebase and fix those extra testcases here? Or do you think that's best left for a separate patch?

I've committed 013030a0b213a75e0403fcdb5a070d21831ee561. Do you want to rebase and fix those extra testcases here? Or do you think that's best left for a separate patch?

Oh, I'll rebase to the latest version and update my patch

Hi, @dmgreen, the previous implementation doesn't take the order of extends into account, so Instruction does not dominate all uses error will still appear in one of the testcases you mentioned. I have updated the patch to fix the failed case. Could you please check whether this modification is appropriate? Thanks very much.

dmgreen accepted this revision.Aug 12 2021, 9:17 AM

Yeah, this sounds good to me. I might have used std::distance(TargetBB->begin(), UI) < std::distance(TargetBB->begin(), InsertPt) myself, as there will only even be a few sinking nodes and they may not be in the TargetDB, but your way with collecting the block sounds good too.

LGTM

llvm/lib/CodeGen/CodeGenPrepare.cpp
6954

I would either use uint32_t or uint64_t to be specific. I don't think there could be more than 2^32 instructions in a block.

This revision is now accepted and ready to land.Aug 12 2021, 9:17 AM

Yeah, this sounds good to me. I might have used std::distance(TargetBB->begin(), UI) < std::distance(TargetBB->begin(), InsertPt) myself, as there will only even be a few sinking nodes and they may not be in the TargetDB, but your way with collecting the block sounds good too.

LGTM

Why not ‘comesBefore’ ?