This is an archive of the discontinued LLVM Phabricator instance.

[InlineFunction] Disable emission of alignment assumptions by default
ClosedPublic

Authored by nikic on Mar 26 2020, 1:31 PM.

Details

Summary

In D74183 clang started emitting alignment for sret parameters unconditionally. This caused a 1.5% compile-time regression on tramp3d-v4. The reason is that we now generate many instance of IR like

%ptrint = ptrtoint %class.GuardLayers* %guards_m to i64
%maskedptr = and i64 %ptrint, 3
%maskcond = icmp eq i64 %maskedptr, 0
tail call void @llvm.assume(i1 %maskcond)

to preserve the alignment information during inlining. Based on the size increase of the final binary, it is likely that these assumptions not only increase compile-time, but also regress optimizations (due to the usual issues with assumes).

We already encountered the same problem in Rust, where we (unlike Clang) generally prefer to emit alignment information absolutely everywhere it is available. We were only able to do this after hardcoding -preserve-alignment-assumptions-during-inlining=false, because we were seeing significant optimization and compile-time regressions otherwise.

This patch disables -preserve-alignment-assumptions-during-inlining by default, because we should not be punishing people for adding more alignment annotations.

I think once the operand bundle work by @Tyker and @jdoerfert shakes out, it might be possible to use an operand bundle based assume for this instead and avoid some/most of the overhead.

Diff Detail

Event Timeline

nikic created this revision.Mar 26 2020, 1:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 26 2020, 1:31 PM

Is there no phase-ordering-esque test for that?

I'm fine with disabling this for now. The IRBuilder create alignment as operand bundle patch is already somewhere but not merged (under an option).
We should measure that solution and then either enable this again by default or determine what else is needed.
FWIW, I expect us to need a pass that merges and eliminates llvm.assume with operand bundles soon, that should drive compile time down again for cases where we emit equivalent or subsuming assumptions.

nikic updated this revision to Diff 253104.Mar 27 2020, 6:56 AM

Add phase ordering test.

nikic added a comment.Mar 27 2020, 7:01 AM

Is there no phase-ordering-esque test for that?

After looking at IR diffs a bit more, one difference I noticed is that jump threading is not being performed in some places. In this case the reason is not multi-use, but cost heuristics based on instruction counts. The alignment assumption adds 4 extra instructions, and if the block duplication threshold is at 6 instructions, that can easily make a difference. The phase ordering test is loosely based around that idea. Does that look useful to you?

@jdoerfert @Tyker In case it isn't on the roadmap yet... I guess we need to switch all the code that is currently skipping debug intrinsics to skip debug intrinsics and assumes, to make sure they don't affect codegen. With operand bundles, that should remove any impact assumes have on instruction count heuristics. Without operand bundles, it will at least lessen it by not counting the assume itself.

Is there no phase-ordering-esque test for that?

After looking at IR diffs a bit more, one difference I noticed is that jump threading is not being performed in some places. In this case the reason is not multi-use, but cost heuristics based on instruction counts. The alignment assumption adds 4 extra instructions, and if the block duplication threshold is at 6 instructions, that can easily make a difference. The phase ordering test is loosely based around that idea. Does that look useful to you?

Yes, thank you.

@jdoerfert @Tyker In case it isn't on the roadmap yet... I guess we need to switch all the code that is currently skipping debug intrinsics to skip debug intrinsics and assumes, to make sure they don't affect codegen. With operand bundles, that should remove any impact assumes have on instruction count heuristics. Without operand bundles, it will at least lessen it by not counting the assume itself.

lebedev.ri added inline comments.Mar 27 2020, 7:15 AM
test/Transforms/PhaseOrdering/inlining-alignment-assumptions.ll
4

Err, one thing i forgot to add: i mainly wanted to see the test that exercises
the *default* value of preserve-alignment-assumptions-during-inlining.
So i think this needs one more line like

; RUN: opt -S -O2 -preserve-alignment-assumptions-during-inlining=0 < %s | FileCheck %s --check-prefixes=CHECK,ASSUMPTIONS-OFF,FALLBACK-0
; RUN: opt -S -O2 -preserve-alignment-assumptions-during-inlining=1 < %s | FileCheck %s --check-prefixes=CHECK,ASSUMPTIONS-ON,FALLBACK-1
; RUN: opt -S -O2 < %s | FileCheck %s --check-prefixes=CHECK,ASSUMPTIONS-OFF,FALLBACK-DEFAULT
nikic updated this revision to Diff 253130.Mar 27 2020, 8:37 AM
nikic marked an inline comment as done.

Add a RUN line for the default behavior as well.

I don't object to doing this, but...

After looking at IR diffs a bit more, one difference I noticed is that jump threading is not being performed in some places. In this case the reason is not multi-use, but cost heuristics based on instruction counts. The alignment assumption adds 4 extra instructions, and if the block duplication threshold is at 6 instructions, that can easily make a difference.

This an independent problem that we might fix first. Jump threading should be collecting and ignoring ephemeral values for it's heuristic. If it's not doing that, please fix that first and then reevaluate.

I don't object to doing this, but...

After looking at IR diffs a bit more, one difference I noticed is that jump threading is not being performed in some places. In this case the reason is not multi-use, but cost heuristics based on instruction counts. The alignment assumption adds 4 extra instructions, and if the block duplication threshold is at 6 instructions, that can easily make a difference.

This an independent problem that we might fix first. Jump threading should be collecting and ignoring ephemeral values for it's heuristic. If it's not doing that, please fix that first and then reevaluate.

Given the ongoing work to convert assumes to operand bundles, which would reduce ignoring of ephemeral values to ignoring the assume itself, I don't think it makes sense to do such a change at this time. Ignoring ephemeral values is not cheap, and we have numerous places that use limited length instruction scans as a cheap heuristic. I don't want to start computing ephemeral values every time InstCombine does a backwards scan from a load or store.

I'm only using JumpThreading as an illustrative example for a test case here. As you well know, JumpThreading is neither the only nor the most important issue when it comes to handling of assumes. As things stand right now, fixing any particular place is not going to change anything about the big picture. And of course, it will do nothing about compile-time impact.

That said, we're already forced to disable this in Rust anyway, so I'm not particularly invested in making this change -- it makes no difference to us. I'm happy to abandon this revision if it does not seem sufficiently beneficial on its own.

I'm in favor of adding the test (and changing the default value). We have a clear path to re-enable it and it is not far away either.

arsenm added a subscriber: arsenm.Apr 28 2020, 10:44 AM

The ptrtoint inhibits SROA. Don't we have a less cumbersome way to test for alignment?

Don't we have a less cumbersome way to test for alignment?

That is the current canonical alignment assumption, at least until the attribute bundles are here.

The ptrtoint inhibits SROA.

That issue seems not entirely new to these changes.
Is there a bug with reproducer?

Don't we have a less cumbersome way to test for alignment?

That is the current canonical alignment assumption, at least until the attribute bundles are here.

The ptrtoint inhibits SROA.

That issue seems not entirely new to these changes.
Is there a bug with reproducer?

Yes:

; RUN: opt -S -O3 < %s | FileCheck %s

; CHECK: @caller
; CHECK-NOT: alloca
; CHECK-NEXT: ret void

target datalayout = "e-p:64:64-p5:32:32-A5"

define amdgpu_kernel void @caller() {
  %alloca = alloca i64, align 8, addrspace(5)
  %cast = addrspacecast i64 addrspace(5)* %alloca to i64*
  call void @callee(i64* sret align 8 %cast)
  ret void
}

define internal void @callee(i64* noalias sret align 8 %arg) {
  store i64 0, i64* %arg, align 8
  ret void
}
cfang added a subscriber: cfang.Apr 28 2020, 1:40 PM
arsenm accepted this revision.Apr 29 2020, 3:24 PM

LGTM. This should be committed until SROA is fixed

This revision is now accepted and ready to land.Apr 29 2020, 3:24 PM
nikic added a comment.Apr 30 2020, 1:40 PM

@rampitec Thanks. I guess that means the situation here is somewhat worse for targets that use address spaces. As we cannot assume that an addrspace cast preserves alignment (as far as I know), we will end up inserting the alignment assumptions in much more cases. On targets that do not use address spaces, we would not insert these assumptions for allocas, and thus sidestep the issue.

I've submitted https://bugs.llvm.org/show_bug.cgi?id=45763 to make sure the SROA issue is tracked somewhere.

This revision was automatically updated to reflect the committed changes.