This is an archive of the discontinued LLVM Phabricator instance.

[flang] Turn on use-desc-for-alloc by default
ClosedPublic

Authored by jeanPerier on Apr 21 2023, 2:27 AM.

Details

Summary

Currently, local allocatables and contiguous/scalar pointers (and some
other conditions) are lowered to a set of independent variables in FIR
(one for the address, one for each bound and one for character length).
The intention was to help LLVM get rids of descriptors. But LLVM knows
how to do that anyway in those cases:

subroutine foo(x)
 real, target :: x(100)
 real, pointer, contiguous :: p(:)
 p => x
 call bar(p(50))
end subroutine

The output fir the option on or off is the same after llvm opt -O1,
there is no descriptor anymore, the indirection is removed.

define void @foo_(ptr %0) local_unnamed_addr {
  %2 = getelementptr [100 x float], ptr %0, i64 0, i64 49
  tail call void @bar_(ptr %2)
  ret void
}

So the benefit of not using a descriptor in lowering is questionable,
and although it is abstracted as much as possible in the so called
MutableBoxValue class that represent allocatable/pointer in lowering
it is still causing bugs from time to time, and will also be a bit
problematic when emitting debug info for the pointer/allocatable.

In HLFIR lowering, the simplification to always use a descriptor in
lowering was already made. This patch allows decorrelating the impact
from this change from the bigger impact HLFIR will have so that it
is easier to get feedback if this causes performance issues.

The lowering tests relying on the previous behavior are only updated
to set back this option to true. The reason is that I think we should
preserve coverage of the code dealing with the "non descriptor" approach
in lowering until we actually get rid of it. The other reason is that
the test will have to be or are already covered by equivalent HLFIR
tests, which use descriptors.

[edit: fixed the example to pass p instead of x to bar, otherwise the test is pointless. The resulting LLVM is the same after opt -O1].

Diff Detail

Event Timeline

jeanPerier created this revision.Apr 21 2023, 2:27 AM
jeanPerier requested review of this revision.Apr 21 2023, 2:27 AM
tblah added a comment.Apr 21 2023, 2:50 AM

Could there be cases where LLVM optimization passes take longer to run because they are removing these descriptors we know we don't need? I would guess this is okay for real code but I was wondering if you have tried it?

Could there be cases where LLVM optimization passes take longer to run because they are removing these descriptors we know we don't need? I would guess this is okay for real code but I was wondering if you have tried it?

Good question. The problem is the "we know we don't need". At that point of the lowering code, we do not really know that we do not need a descriptor, in fact, we still create one with this option turned off, because as soon as the pointer/allocatable is passed as a pointer/allocatable to the runtime or user code, we need a descriptor an we insert sync operation from/to these values after/before the call. This early optimization is only a win if the pointer/allocatable never escapes.

Here is some dumb numbers to illustrate:

program_1.f90

subroutine foo(x)
 real, target :: x(100)
 real, pointer, contiguous :: p(:)
 p => x
 call bar(p(50))
 ! ...
 ! ... repeat call bar(p(50)) 10000 times.
 ! ...
end subroutine

--use-desc-for-alloc=false:

  • Number of .ll lines after codgeneration from FIR to LLVM: 50k.
  • time flang-new -O1 (on a beefy X86_64 machine): 2s.

--use-desc-for-alloc=true:

  • Number of .ll lines after codgeneration from FIR to LLVM: 180k.
  • time flang-new -O1 (on a beefy X86_64 machine): 5s.

-> Same code after -O1, but x2 faster compilation with the previous --use-desc-for-alloc=false option.

program_2.f90:

subroutine foo(x)
 real, target :: x(100)
 real, pointer, contiguous :: p(:)
 interface
  subroutine bar_ptr(p)
    real, pointer, contiguous :: p(:)
  end subroutine
 end interface

 p => x
 call bar(p(50))
 ! ...
 ! ... repeat call bar_ptr(p(50)) 10000 times.
 ! ...
end subroutine

--use-desc-for-alloc=false:

  • Number of .ll lines after codgeneration from FIR to LLVM: 300k.
  • time flang-new -O1 (on a beefy X86_64 machine): 25s. And did not managed to get rid of all the intermediate descriptor sync, even at -O3!

--use-desc-for-alloc=true:

  • Number of .ll lines after codgeneration from FIR to LLVM: 10k.
  • time flang-new -O1 (on a beefy X86_64 machine): 0.8s.

-> x30 (!) faster compilation with the new --use-desc-for-alloc=true option, and better code with it (the sync from/to descriptor are not optimized away with -O3 with the --use-desc-for-alloc=flase). The reason for the sync optimization failure is that LLVM cannot guess that it can get rid of the code that sets-up the CFI code and versions in the descriptor when doing the sync from the values to the fir.box (it assumes the previous call may have modified those descriptor fields, and that it should keep these field updates generate by the fir.embox before the next calls).

Conclusion: the answer to your question depends of the actual workloads... I was able to write you stupid code that compiles 30x faster with this patch, but it is probably not representative (This only shows if passing the pointer/allocatabale to a procedure taking a pointer/allocatable a lot). The take-away is that the inserted "sync" operations of the current lowering are expensive and hard to optimize away, while it is easy to get rid of descriptors. So this was probably a premature optimization anyway. Overall, on "actual" workloads, I do not really expect we are able to measure much difference after this patch. But if you do see perf regressions, we can revert that and try to add a FIR pass to "scalarize" fir.box before we turn them into big LLVM structs and many insert/extract operations. This pass would be able to easily see the cost of scalarization, which we are not at that point.

jeanPerier edited the summary of this revision. (Show Details)Apr 21 2023, 5:10 AM
tblah accepted this revision.Apr 21 2023, 5:26 AM

Thanks for the detailed explanation.

This looks good to me. As you say in the description, merging this is the easiest way to find out early if doing this has unforeseen consequences.

Please wait for review from somebody else too, because I haven't worked much with FIR lowering.

This revision is now accepted and ready to land.Apr 21 2023, 5:26 AM
This revision was landed with ongoing or failed builds.Apr 24 2023, 12:08 AM
This revision was automatically updated to reflect the committed changes.