Page MenuHomePhabricator

[CodeGenPrepare][WIP] Convert uncond. branch to return into a return to help with shrink-wrapping
AbandonedPublic

Authored by sfertile on Aug 8 2017, 9:55 PM.

Details

Summary

Currently, CGP will get rid of mostly empty blocks by merging them with their successors. However, in a case where a block BB simply branches unconditionally to a block that has no code other than a return, it can be helpful if BB just returned since shrink-wrapping might have another candidate for the epilogue.

Motivating test case:

; Function Attrs: noinline nounwind readnone
define signext i32 @callee(i32 signext %a, i32 signext %lim) local_unnamed_addr #0 {
entry:
  %cmp5 = icmp sgt i32 %lim, 0
  br i1 %cmp5, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  br label %for.body

for.cond.cleanup:                                 ; preds = %for.body, %entry
  %Ret.0.lcssa = phi i32 [ 0, %entry ], [ %0, %for.body ]
  ret i32 %Ret.0.lcssa

for.body:                                         ; preds = %for.body.preheader, %for.body
  %i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %Ret.06 = phi i32 [ %0, %for.body ], [ 0, %for.body.preheader ]
  %0 = tail call i32 asm "add $0, $1, $2", "=r,r,r,~{r14},~{r15},~{r16},~{r17},~{r18},~{r19},~{r20},~{r21},~{r22},~{r23},~{r24},~{r25},~{r26},~{r27},~{r28},~{r29},~{r30},~{r31}"(i32 %a, i32 %Ret.06)
  %inc = add nuw nsw i32 %i.07, 1
  %exitcond = icmp eq i32 %inc, %lim
  br i1 %exitcond, label %for.cond.cleanup, label %for.body

IR into Codegen Prepare:

; Function Attrs: noinline nounwind readnone
define signext i32 @caller(i32 signext %a, i32 signext %lim) local_unnamed_addr #0 {
entry:
  %cmp5 = icmp sgt i32 %lim, 0
  br i1 %cmp5, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  br label %for.body

for.cond.cleanup.loopexit:                        ; preds = %for.body
  %.lcssa = phi i32 [ %0, %for.body ]
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  %Ret.0.lcssa = phi i32 [ 0, %entry ], [ %.lcssa, %for.cond.cleanup.loopexit ]
  ret i32 %Ret.0.lcssa

for.body:                                         ; preds = %for.body, %for.body.preheader
  %lsr.iv = phi i32 [ %lsr.iv.next, %for.body ], [ %lim, %for.body.preheader ]
  %Ret.06 = phi i32 [ %0, %for.body ], [ 0, %for.body.preheader ]
  %0 = tail call i32 asm "add $0, $1, $2", "=r,r,r,~{r14},~{r15},~{r16},~{r17},~{r18},~{r19},~{r20},~{r21},~{r22},~{r23},~{r24},~{r25},~{r26},~{r27},~{r28},~{r29},~{r30},~{r31}"(i32 %a, i32 %Ret.06) #3, !srcloc !3
  %lsr.iv.next = add i32 %lsr.iv, -1
  %exitcond = icmp eq i32 %lsr.iv.next, 0
  br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
}

Block for.cond.cleanup.loopexit can be used for the epilogue if flow has gone through for.body whereas block for.cond.cleanup can be used for the epilogue if flow never went into the loop.

The patch certainly increases the amount of shrink-wrapping we do. I'm currently running spec on PPC and will post the net change once the results are ready.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai created this revision.Aug 8 2017, 9:55 PM

As one might expect, this causes a fair number of lit test cases to fail (65 to be more precise). Rather than sinking a whole bunch of time initially on updating them and producing a very large patch, I wanted to get a sense of whether the reviewers are interested in proceeding with this patch first.

So if we do want this (and we do want it to trigger for all targets), I'll update the test cases upon hearing so.

lib/CodeGen/CodeGenPrepare.cpp
332

I initially meant to add a target hook to see if this is desired by the target, but decided against it as I don't see an issue with always doing so. Depending on the review comments, I can either remove this comment or re-add the target hook.

nemanjai retitled this revision from [CodeGenPrepare] Convert uncond. branch to return into a return to help with shrink-wrapping to [CodeGenPrepare][WIP] Convert uncond. branch to return into a return to help with shrink-wrapping.Aug 8 2017, 10:34 PM
junbuml added a subscriber: junbuml.Aug 9 2017, 7:44 AM
mcrosier added a subscriber: bmakam.Aug 9 2017, 8:09 AM

Just some stats (on PPC):

  • 3.8% more functions shrink-wrapped in the bootstrap build
  • 14.3% more functions shrink-wrapped in SPEC

Performance wise SPEC-FP shows no significant change.
SPEC-INT shows a 7% improvement in Xalancbmx, 1% improvement in Sjeng, slightly under 5% degradation in h264ref and sub-1% improvements on everything else.

  • Overall, I agree with the idea. I didn't see any performance differences in Xalancbmk and h264ref on AArach64 though. I guess this will not leave a branch instruction in the empty block as it directly return.
  • Should the predecessor branching to an empty return block also be empty? Your implementation doesn't seem to check if the predecessor is empty.
  • Just curious about the -5% in h264ref on PPC?
lib/CodeGen/CodeGenPrepare.cpp
653

Isn't it possible to do this in the main loop of eliminateMostlyEmptyBlocks()?

2625–2627

It seems that you didn't check if the predecessor is empty? Is this what you intended?

2639–2640

I guess you might reduce DEBUGs in this function. Right?

test/CodeGen/MIR/Generic/early-ret-bb.mir
1

Why do you use .mir, instead of .ll?

  • Overall, I agree with the idea. I didn't see any performance differences in Xalancbmk and h264ref on AArach64 though. I guess this will not leave a branch instruction in the empty block as it directly return.
  • Should the predecessor branching to an empty return block also be empty? Your implementation doesn't seem to check if the predecessor is empty.
  • Just curious about the -5% in h264ref on PPC?

Do you see any performance difference in any of your benchmarks (i.e. other than Xalan and h264ref)?
I haven't investigated the degradation on h264ref yet, but I do plan to look into it before making further updates to this patch. I figured it'd be nice to hear from other targets as well about whether this has any impact.

lib/CodeGen/CodeGenPrepare.cpp
653

I initially had it there but removed it for two reasons (neither of which I feel too strongly about):

  1. It just didn't feel like it should go there since I'm doing an inherently different thing
  2. That loop skips the entry block and I didn't see a compelling reason to do so for this. Of course, I could just have one invocation before the loop but this looked cleaner.

As I said, I'm not opposed to putting it back.

2625–2627

I initially had a check that the predecessor has a single instruction (i.e. the branch). But as it turns out, the predecessor will actually have a PHI as well in the motivating test case (albeit a PHI with a single entry).

Then I got to thinking about whether there's a compelling reason for the predecessor to have to be empty. And I couldn't think of one.
I suppose we can check if it:

  • has a single instruction
  • has a single non-PHI instruction
  • has a single instruction or a single predecessor

Not sure which of those would be the most appropriate (I am leaning toward the middle one).

2639–2640

Yeah. It was nice during development to have all of that. And I figured I'd leave it here while this patch is WIP for anyone that wants to play with the patch. But if this becomes a real patch, I'll reduce the DEBUG messages.

test/CodeGen/MIR/Generic/early-ret-bb.mir
1

I do have a .ll test case as well. I wanted to also provide a .mir test case to test the actual transformation in CGP rather than just its effect on the final ASM.

Do you see any performance difference in any of your benchmarks (i.e. other than Xalan and h264ref)?

I try to test this on AArch64 and share the result.

lib/CodeGen/CodeGenPrepare.cpp
2625–2627

Based on your summary and function name (dupRetFedByEmptyBlocks), I just expected that you check if BB is empty. However, your implementation doesn't seem to really check if BB is empty or not. So, I wanted to understand your intention.

I also doubt if BB really need to be a kind of empty? For me, it seems okay to duplicate return to BB even when BB is not empty as you implement right now.

I think this need to be done before eliminateMostlyEmptyBlocks(), but BB doesn't need to be empty.

Please let me know if there is any downside of doing this when BB is non-empty.

test/CodeGen/MIR/Generic/early-ret-bb.mir
1

Why don't you use opt :

opt < %s -codegenprepare

Ping.
I am just wondering if any other targets have had a chance to try this out and whether there's any interest in proceeding with this patch as a result.
I will certainly address the comments raised by @junbuml if this patch has a future, but I'd like to evaluate whether there's any interest in proceeding before doing more work on this patch. Perhaps I'll open this up to a wider audience with a quick RFC email on llvm-dev.

Hi Nemanja,

Sorry for the delay. I tried it on AArach64 for spec2006/2000, and I didn't see any clear difference in performance. However, I do not disagree with the idea.
Thanks,
Jun

Isn't tail duplication a better place to do this?

Isn't tail duplication a better place to do this?

Possibly, I'll have a look at that.

sfertile commandeered this revision.Oct 27 2017, 10:02 AM
sfertile added a reviewer: nemanjai.
sfertile edited the summary of this revision. (Show Details)Nov 1 2017, 12:10 PM
sfertile edited the summary of this revision. (Show Details)Nov 1 2017, 12:13 PM

Isn't tail duplication a better place to do this?

So I attempted this in the PreRA tail-duplication pass. Unfortunately by the time we get there the opportunity is gone. If we do not duplicate the return then CodegenPrepare will end up deleting the for.cond.cleanup.loopexit block and we loose the shrink wrapping opportunity.

So I attempted this in the PreRA tail-duplication pass. Unfortunately by the time we get there the opportunity is gone. If we do not duplicate the return then CodegenPrepare will end up deleting the for.cond.cleanup.loopexit block and we loose the shrink wrapping opportunity.

Then perhaps shrink-wrapping should be extended to be able to add blocks. CodeGenPrepare is not the place for this.

CodeGenPrepare is not the place for this.

Thanks for the feedback. Can you elaborate a bit more on why we wouldn't want to do this in CodeGenPrepare? Is the issue that we are always duplicating the return right now? If thats the problem then we can try to come-up with situations where it may be profitable to duplicate the return and only do so under those circumstances. Also shrink-wrapping isn't the only thing helped by this. This also addresses an issue where we miss tail-call opportunities because the result of the call feeds a PHI which feeds the return instruction. By duplicating the return into the predecessor and removing the PHI we are able to tail call. In this case we would have to do the transformation here since we decide whether to tail-call or not during the SDag lowering, which comes next and only has access to a single basic block at a time.

Then perhaps shrink-wrapping should be extended to be able to add blocks.

I don't know enough about shrink wrapping to know if that is feasible or not. It doesn't seem practical to me, but maybe Francis or Quentin can comment?

I don't know enough about shrink wrapping to know if that is feasible or not. It doesn't seem practical to me, but maybe Francis or Quentin can comment?

We'd actually want to do something relate to solve https://bugs.llvm.org/show_bug.cgi?id=33868.
We didn't have a chance to look at it yet.

What I'd like to have is a way to simulate critical edge splitting in DT and only materialize the basic block if we end up inserting something into it.

Don't know if splitting critical edges would solve what you are supporting here, though.

That being said, like Krzysztof said, I don't think CodeGenPrepare is the right place for this.

Thanks for the feedback. Can you elaborate a bit more on why we wouldn't want to do this in CodeGenPrepare?

Generally speaking, CGP is kind of a hack. We should work to take away from it, not to add to it. From what I know, the main motivation for CGP was to address some deficiencies of the instruction selection via the selection DAG, so we should at least try to limit it to doing that.

Opportunistically adding simple transformations to where it's convenient, not to where they belong logically is a suboptimal long-term strategy, although it looks appealing in the short term. Writing an optimization that can create shrink-wrapping opportunities on optimized code is certainly more involved, but will likely provide better results.

Thanks for the feedback. Can you elaborate a bit more on why we wouldn't want to do this in CodeGenPrepare?

Generally speaking, CGP is kind of a hack. We should work to take away from it, not to add to it. From what I know, the main motivation for CGP was to address some deficiencies of the instruction selection via the selection DAG, so we should at least try to limit it to doing that.

Thanks, I didn't realize that was CGP main motivation.

Thanks for the feedback. Can you elaborate a bit more on why we wouldn't want to do this in CodeGenPrepare?

Opportunistically adding simple transformations to where it's convenient, not to where they belong logically is a suboptimal long-term strategy, although it looks appealing in the short term. Writing an optimization that can create shrink-wrapping opportunities on optimized code is certainly more involved, but will likely provide better results.

Yes, I can appreciate that. Abandoning this patch to address it properly makes sense.

What is people opinion on duplicating the return for the very specific case of enabling a tail-call? I don't think that can be cleaned up inside the selection-dag, and after the selection-dag is too late. Is cgp the best place to address that or is there somewhere better to consider?

I don't know enough about shrink wrapping to know if that is feasible or not. It doesn't seem practical to me, but maybe Francis or Quentin can comment?

We'd actually want to do something relate to solve https://bugs.llvm.org/show_bug.cgi?id=33868.
We didn't have a chance to look at it yet.

What I'd like to have is a way to simulate critical edge splitting in DT and only materialize the basic block if we end up inserting something into it.

Don't know if splitting critical edges would solve what you are supporting here, though.

IIUC I think it would address our motivating examples, at least the for the shrink-wrap opportunities we were interested in . Can you describe what you had planned a bit more? I believe this is something either I or one of my team members would be willing to pick up.

sfertile abandoned this revision.Nov 4 2017, 10:19 AM

It seems dupRetToEnableTailCallOpts already does duplicate all the tail-call cases PPC is missing on any of the targets that implement 'mayBeEmittedAsTailCall', so I know to address that.