In current implementation, the instruction to be sunk will be inserted before the target instruction without considering the def-use tree, which may case Instruction does not dominate all uses error. We need to choose a suitable location to insert
according to the use chain
Details
Diff Detail
Event Timeline
This happens because out of the chain of insert; shuffle only the insert needs to be sunk?
It may be better to remove the UI->getParent() == TargetBB from if (UI->getParent() == TargetBB || isa<PHINode>(UI)) and update InsertPoint in the loop as we go. It won't need to clone the instruction, but it can update InsertPoint and continue.
llvm/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions-1.ll | ||
---|---|---|
4–5 ↗ | (On Diff #363430) | This can be removed. |
25 ↗ | (On Diff #363430) | Bugpoint has a habit of over-reducing test cases, introducing undef where they only make things worse and less maintainable in the longrun. Can you make the test not use any undef? |
Hi, dmgreen, thanks for your review! But I don't quite make sense of your comment. The condition UI->getParent() == TargetBB is used to filter out some instructions already in targetBB to avoid the meaningless sinking, why can it be removed?
shouldSinkOperands will return a chain of instructions, in this case starting at the mul it will return the shuffle and the insert, as those are the instruction it is profitable to sink. They are visited in reverse order in order to sink the last instruction first. The shuffle is already in the TargetBB, so doesn't need to be sunk (or cloned), but we still need to update the InsertPoint when sinking the insert, or else it will fail dominance checks.
It looks like there are some Arm tests failing with the current attempt to fix that. The instruction being sunk may have multiple uses, and moving before the first one we happen to find might not always work, it might not be the instruction that was originally returned by shouldSinkOperands.
Instead, I think it would be better to make sure we are updating InsertPoint as we go in the ToReplace loop. So we add instruction to ToReplace even if they are already in TargetDB, and in the ToReplace loop we check if the instruction is already in the correct BB and just update the InsertPoint if so.
llvm/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions-1.ll | ||
---|---|---|
7 ↗ | (On Diff #363675) | Can this test be added to one of the existing files? There is already a sink-free-instructions.ll test. Otherwise the test should preferably have a better name than sink-free-instructions-1.ll. |
29 ↗ | (On Diff #363675) | This value is dead? Can it be returned instead? |
@dmgreen, thanks very much! We can make use of condition UI->getParent() == TargetBB from if (UI->getParent() == TargetBB || isa<PHINode>(UI)) to update the insertPoint. Actually, the order of users obtained from users() may not follow instructions order in a basic block. So the previous patch was not correct.
llvm/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions-1.ll | ||
---|---|---|
25 ↗ | (On Diff #363430) | Thanks for you comment. The case is reduced from a csmith testcase, so there are some strange instructions. I've made some changes to remove the undef. |
Thanks, looking good. But I do still worry about the order of instructions sunk.
I was trying it out, seeing if it would go wrong when we were sinking a lot of operands. I noticed that the add/sub sinking wasn't really working properly though! There is https://reviews.llvm.org/D107623 to improve that and getting the shuffles to sink.
With that in, can you add these two test to show partially sinking two values at the same time:
define <4 x i32> @sinkadd_partial(<8 x i16> %a1, <8 x i16> %a2, i8 %f) { for.cond4.preheader.lr.ph: %cmp = icmp slt i8 %f, 0 %s2 = shufflevector <8 x i16> %a2, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7> %s1 = shufflevector <8 x i16> %a1, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7> br i1 %cmp, label %for.cond4.preheader.us.preheader, label %for.cond4.preheader.preheader for.cond4.preheader.us.preheader: ; preds = %for.cond4.preheader.lr.ph %e1 = sext <4 x i16> %s1 to <4 x i32> %e2 = sext <4 x i16> %s2 to <4 x i32> %0 = add <4 x i32> %e1, %e2 ret <4 x i32> %0 for.cond4.preheader.preheader: ; preds = %for.cond4.preheader.lr.ph ret <4 x i32> zeroinitializer } define <4 x i32> @sinkadd_partial_rev(<8 x i16> %a1, <8 x i16> %a2, i8 %f) { for.cond4.preheader.lr.ph: %cmp = icmp slt i8 %f, 0 %s2 = shufflevector <8 x i16> %a2, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7> %s1 = shufflevector <8 x i16> %a1, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7> br i1 %cmp, label %for.cond4.preheader.us.preheader, label %for.cond4.preheader.preheader for.cond4.preheader.us.preheader: ; preds = %for.cond4.preheader.lr.ph %e2 = sext <4 x i16> %s2 to <4 x i32> %e1 = sext <4 x i16> %s1 to <4 x i32> %0 = add <4 x i32> %e1, %e2 ret <4 x i32> %0 for.cond4.preheader.preheader: ; preds = %for.cond4.preheader.lr.ph ret <4 x i32> zeroinitializer }
The order of extends in the target block become important.
I've committed 013030a0b213a75e0403fcdb5a070d21831ee561. Do you want to rebase and fix those extra testcases here? Or do you think that's best left for a separate patch?
Hi, @dmgreen, the previous implementation doesn't take the order of extends into account, so Instruction does not dominate all uses error will still appear in one of the testcases you mentioned. I have updated the patch to fix the failed case. Could you please check whether this modification is appropriate? Thanks very much.
Yeah, this sounds good to me. I might have used std::distance(TargetBB->begin(), UI) < std::distance(TargetBB->begin(), InsertPt) myself, as there will only even be a few sinking nodes and they may not be in the TargetDB, but your way with collecting the block sounds good too.
LGTM
llvm/lib/CodeGen/CodeGenPrepare.cpp | ||
---|---|---|
6954 | I would either use uint32_t or uint64_t to be specific. I don't think there could be more than 2^32 instructions in a block. |
I would either use uint32_t or uint64_t to be specific. I don't think there could be more than 2^32 instructions in a block.