This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Avoid postponing rewriting load/store by ignoring lifetime intrinsics in partition's promotability checking
ClosedPublic

Authored by slydiman on May 4 2022, 2:50 PM.

Details

Summary

Performance drift caused by generated ptx between slight different code calling flow.
This patch fixes a bug that generates unnecessary packing/unpacking structure code because of incorrectly handling lifetime intrinsic.

For example, a partition of an alloca may contain many slices:

Partition [0, 4):
  Slice0: [0, 4) used by: load i32 addr;
  Slice1: [0, 4) used by: store i32 v, addr;
  Slice2: [0, 16) used by lifetime.start(16, addr);

When SROA determines if the partition can be promoted, lifetime.start is currently treated as a whole alloca load/store, so Slice0 and Slice1 cannot be promoted at this attempt, but the packing/unpacking code for Slice0 and Slice1 has been generated. After rewrite lifetime.start/end intrinsic, SROA tries again with Slice0 and Slice1 and finally promotes them, but redundant packing/unpacking code remaining in the IRs.

This patch changes promotability checking to ignore lifetime intrinsic (they will be rewritten to correct sizes later), so we can promote the real users (load/store) at the first attempt with optimal code.

Diff Detail

Event Timeline

slydiman created this revision.May 4 2022, 2:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 2:50 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
slydiman requested review of this revision.May 4 2022, 2:50 PM
tra added a subscriber: tra.May 4 2022, 3:54 PM
tra added inline comments.
llvm/test/Transforms/SROA/lifetime-intrinsic.ll
1

I'd add a comment describing what is it we're looking for here.
It's not clear from the test itself why we're checking for those phys. To make sure that we don't do excessive shifting and masking of i32->i16->i32 ?

26

These labels look overly specific. You should probably capture the actual names where the values are set.

nikic added inline comments.May 5 2022, 12:54 AM
llvm/test/Transforms/SROA/lifetime-intrinsic.ll
1

Generally, please reduce this test to the minimal instructions that show the problem.

slydiman updated this revision to Diff 427504.May 5 2022, 6:40 PM

I have updated the test. Now it is generic, not NVPTX specific.

slydiman marked an inline comment as done.May 5 2022, 6:41 PM
nikic added inline comments.May 6 2022, 1:40 AM
llvm/lib/Transforms/Scalar/SROA.cpp
1990

You can use isLifetimeStartOrEnd() here.

llvm/test/Transforms/SROA/lifetime-intrinsic.ll
1

Please use update_test_checks.py to generate test checks.

21

Is the loop required to demonstrate the problem?

nikic added a comment.May 6 2022, 1:50 AM

If I take your example with and without lifetime intrinsics, then with current SROA, the variant without intrinsics produces a worse result: https://llvm.godbolt.org/z/oPMKWzaMq

Results of SROA should be the same with an without lifetime intrinsics, but I'm not sure your patch achieves that -- it looks like it produces a third variant that is better than both?

slydiman added a comment.EditedMay 6 2022, 7:22 AM

If I take your example with and without lifetime intrinsics, then with current SROA, the variant without intrinsics produces a worse result: https://llvm.godbolt.org/z/oPMKWzaMq
Results of SROA should be the same with an without lifetime intrinsics, but I'm not sure your patch achieves that -- it looks like it produces a third variant that is better than both?

Here is the result after this patch. Compare the while_body section.

define i16 @foo(i32* nocapture readonly %loop) #1 {
entry:
  br label %while_cond

while_cond:                                       ; preds = %while_body, %entry
  %arr.sroa.6.0 = phi i32 [ 0, %entry ], [ %res1, %while_body ]
  %arr.sroa.0.0 = phi i32 [ 0, %entry ], [ %res0, %while_body ]
  %loopi = load i32, i32* %loop, align 4
  %loopb = icmp eq i32 %loopi, 0
  br i1 %loopb, label %while_end, label %while_body

while_body:                                       ; preds = %while_cond
  %x = call { i32, i32 } @bar(i32 %arr.sroa.0.0, i32 %arr.sroa.6.0) #0
  %res0 = extractvalue { i32, i32 } %x, 0
  %res1 = extractvalue { i32, i32 } %x, 1
  br label %while_cond

while_end:                                        ; preds = %while_cond
  %arr.sroa.0.0.extract.trunc = trunc i32 %arr.sroa.0.0 to i16
  %arr.sroa.6.4.extract.trunc = trunc i32 %arr.sroa.6.0 to i16
  %ret = add i16 %arr.sroa.0.0.extract.trunc, %arr.sroa.6.4.extract.trunc
  ret i16 %ret
}
slydiman updated this revision to Diff 427652.May 6 2022, 8:51 AM
slydiman marked 3 inline comments as done.

I have updated the code and removed the loop from the test.

nikic added inline comments.May 6 2022, 9:15 AM
llvm/test/Transforms/SROA/lifetime-intrinsic.ll
12

Can you please add a copy of this function that does not use lifetime.start/end, precommit the test, and then rebase this revision so it only shows the test diff?

slydiman updated this revision to Diff 427666.May 6 2022, 9:54 AM
slydiman marked an inline comment as done.
nikic added a comment.May 13 2022, 8:39 AM

This still has the problem that the result differs depending on whether the lifetime intrinsic is present or not. It makes me suspicious that this is not fixing the right thing. I would at least like an explanation of where the difference comes from.

Why do you think the results of SROA should be the same with an without lifetime intrinsics?

The precommited tests, you asked for above, contradict this statement, and demonstrate that lifetime intrinsics help with the optimization.

The problem with the current implementation is that the first element is not handled properly, fixing which is the subject of this patch.

nikic added a comment.May 16 2022, 5:45 AM

I've done some due diligence, and the relevant difference starts with this code: https://github.com/llvm/llvm-project/blob/1ddc6ab1a9c321e02da4fab836c5f863a906108b/llvm/lib/Transforms/Scalar/SROA.cpp#L4412 If lifetime intrinsics are present, we'll mark both the [0,2) and [0,4) slices as unsplittable, and will then visit the [0,4) partition first, which has viable integer widening. Without lifetime intrinsics, we'll mark only [0,2) as unsplittable, and visit partition [0,2) first, where widening is not viable. I think something is very wrong here, but I'm not confident in which place it is and don't want to spend a lot of time on this.

In any case, I think your patch makes sense on a conceptual level, so let's go with it for now.

llvm/lib/Transforms/Scalar/SROA.cpp
1993

These already is a check for isLifetimeStartOrEnd() and isDroppable() below, only difference is that it's after the RelEnd > Size check. I think you'd want to add the isDroppable() check here as well, and then drop the IntrinsicInst handling below entirely.

You can also move the Use *U = S.getUse(); line before here.

slydiman updated this revision to Diff 429817.May 16 2022, 12:41 PM
slydiman added inline comments.May 16 2022, 12:45 PM
llvm/lib/Transforms/Scalar/SROA.cpp
1993

I have moved Use *U = S.getUse();.
Please note I cannot move isDroppable() bacause this block depends on the previous block

} else if (MemIntrinsic *MI = dyn_cast<MemIntrinsic>(U->getUser())) {

and such moving will break the current logic.

nikic added inline comments.May 16 2022, 12:57 PM
llvm/lib/Transforms/Scalar/SROA.cpp
1993

Do you mean in case a MemIntrinsic is droppable? This cannot happen ("droppable" here means "assume intrinsic" or similar).

slydiman added inline comments.May 16 2022, 1:08 PM
llvm/lib/Transforms/Scalar/SROA.cpp
1993

I can move the following code to the beginning

if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(U->getUser())) {
  if (II->isLifetimeStartOrEnd()) {
    return true;
  } else if (!II->isDroppable()) {
    return false;
  }
}

The returned value false here may be wrong because the following block may do not return anything, prevent executing the block if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(U->getUser())) below and finally the return value must be true

} else if (MemIntrinsic *MI = dyn_cast<MemIntrinsic>(U->getUser())) {
  if (MI->isVolatile() || !isa<Constant>(MI->getLength()))
    return false;
  if (!S.isSplittable())
    return false; // Skip any unsplittable intrinsics.
}

Actually such moving breakes a lot of tests.

nikic added inline comments.May 16 2022, 1:33 PM
llvm/lib/Transforms/Scalar/SROA.cpp
1993

To be clear, this is what I meant:

if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(U->getUser())) {
  if (II->isLifetimeStartOrEnd() || II->isDroppable())
    return true;
}

You are returning false for all non-droppable intrinsics, while we only want to return true for droppable intrinsics, and leave everything else to go through the code below (including MemIntrinsic).

slydiman updated this revision to Diff 429845.May 16 2022, 2:13 PM
slydiman marked an inline comment as done.
nikic accepted this revision.May 16 2022, 2:49 PM

LGTM

This revision is now accepted and ready to land.May 16 2022, 2:49 PM
This revision was landed with ongoing or failed builds.May 17 2022, 2:26 AM
This revision was automatically updated to reflect the committed changes.