This is an archive of the discontinued LLVM Phabricator instance.

[flang] Lower TRANSPOSE without using runtime.
ClosedPublic

Authored by vzakhari on Jul 11 2022, 9:09 AM.

Details

Summary

Calling runtime TRANSPOSE requires a temporary array for the result,
and, sometimes, a temporary array for the argument. Lowering it inline
should provide faster code.

I added -opt-transpose control just for debugging purposes temporary.
I am going to make driver changes that will disable inline lowering
for -O0. For the time being I would like to enable it by default
to expose the code to more tests.

Diff Detail

Event Timeline

vzakhari created this revision.Jul 11 2022, 9:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 9:09 AM
vzakhari requested review of this revision.Jul 11 2022, 9:09 AM
jeanPerier accepted this revision.Jul 12 2022, 12:20 AM

Looks great to me

This revision is now accepted and ready to land.Jul 12 2022, 12:20 AM

Thank you for the review, Jean!

This revision was automatically updated to reflect the committed changes.
Leporacanthicus added inline comments.
flang/lib/Lower/ConvertExpr.cpp
96

I definitely would prefer either a -O0 turns this off.

We probably should make the flang driver do this too.

5100

Did you consider making this a FIR-pass, similar to how I've done SUM-intrinsic call here:
https://reviews.llvm.org/D125407

That retains the original FIR code for a while longer, potentially allowing other optimisation passes to do their thing, before the decision to inline the transpose function.

flang/test/Lower/Intrinsics/transpose_opt.f90
15

All of the checks here seem to be auto-generated. Is it actually necessary to test ALL of them?

It makes it easier to see what is ACTUALLY the key parts of the test, if there's fewer CHECK lines - and it also means that if someone decides to change other aspects of the compiler, they (or someone else) won't have to change tests for that in places that don't really matter.

I suspect we have (other) tests that check that we can allocate a temporary array on the heap (or on the stack), for example.

Even more so for the second, longer test.

vzakhari added inline comments.Jul 12 2022, 3:45 PM
flang/lib/Lower/ConvertExpr.cpp
96

Agree. I will work on this.

5100

Thank you for the pointer! Yes, I did consider this. I found that temporary array might have already been created for the argument of TRANSPOSE, e.g. if it is an array expression, so just "inlining" TRANSPOSE late does not get rid of the temporary array overhead. Later optimizations would have to fuse the adjacent loops and optimize away the temporary. I was not able to achieve this with a manually written FIR and existing optimization passes, but I might have missed something.

I wonder what optimization opportunities you are thinking about that we might be missing with the proposed TRANSPOSE lowering - can you please share?

flang/test/Lower/Intrinsics/transpose_opt.f90
15

I used mlir/utils/generate-test-checks.py, and, yes, we do not need to check everything here. I will try to reduce the amount of checks. I guess a CHECK-NOT for the TRANSPOSE runtime call is also missing.

I second @Leporacanthicus here. Also, please note that (from the docs):

There is a strong expectation that authors respond promptly to post-commit feedback and address it. Failure to do so is cause for the patch to be reverted.

Why haven't Matt's suggestions been addressed yet? In particular, the tests added here contain a lot of noise and do not adhere to the official guidelines (from the docs ):

Tests should be minimal, and only check what is absolutely necessary.

Using generate-test-checks.py is fine (and encouraged), but in this particular case some post-processing is clearly required. Also, generating a test with a tool doesn't automatically make it adhere to FileCheck best practices.

Lastly, inlining in LLVM Flang has been discussed extensively with the whole community on Discourse. This patch effectively bypasses any interaction with the community. I've already hinted in https://reviews.llvm.org/D128385 that we would appreciate if you included the community more by posting proposals and inviting people to review on e.g. Discourse. I see at least a couple of questions that would be good to address through a discussion involving more people:

  • IIUC, Mats has been trying "to keep things at a high-level for as long as possible" (Steve's suggestion). This patch suggests that that design principle is not a priority. Is this correct?
  • Do you see this as a replacement or complementary to https://reviews.llvm.org/D128385? It sounds like we may end-up with 2 different approaches for inlining in the near future. Is that the end goal?

Lastly, how do you plan to integrate an llvm::cl option with the driver?

-Andrzej

kiranchandramohan added inline comments.
flang/lib/Lower/ConvertExpr.cpp
5100

I was not able to achieve this with a manually written FIR and existing optimization passes, but I might have missed something.

The presentation (https://slides.com/rajanwalia/deck) by @rjnw talked about Loop Fusion in Affine FIR. Did you try the Affine Promotion and Affine passes?

Also, please see the opinion express by the code-owner of this part of the code base that "it is a stronger investment to remove temporaries in the optimizer than special casing" in https://discourse.llvm.org/t/rfc-how-to-inline-fortran-inrinsics/61761/23. From this opinion what I inferred is that even if it does not remove now, this is the preferred path and we should invest to improve the optimizer to ultimately remove the temporaries.

I wonder what optimization opportunities you are thinking about that we might be missing with the proposed TRANSPOSE lowering - can you please share?

I can think of hoisting the call to transpose outside a loop, CSE etc.

I second @Leporacanthicus here. Also, please note that (from the docs):

There is a strong expectation that authors respond promptly to post-commit feedback and address it. Failure to do so is cause for the patch to be reverted.

Sorry for the delay. I was trying to address both Matt's concerns, i.e. disable the "optimized" lowering under -O0 and remove unnecessary checks from the LIT tests. I am still learning Flang infrastructure, so it took me a while to figure out how to properly control lowering from the driver. I am about ready to upload the changes for review, so please bear with me.

Why haven't Matt's suggestions been addressed yet? In particular, the tests added here contain a lot of noise and do not adhere to the official guidelines (from the docs ):

Tests should be minimal, and only check what is absolutely necessary.

Using generate-test-checks.py is fine (and encouraged), but in this particular case some post-processing is clearly required. Also, generating a test with a tool doesn't automatically make it adhere to FileCheck best practices.

I believe the test is minimal, in particular I wanted to test the new TRANSPOSE lowering itself, but also passing result of TRANSPOSE to a subroutine and a case of using TRANSPOSE in the context of assignment with allocatable LHS (because I have made mistakes for this particular use-case while implementing this). Can you please clarify what post-processing you are referring to?

Lastly, inlining in LLVM Flang has been discussed extensively with the whole community on Discourse. This patch effectively bypasses any interaction with the community. I've already hinted in https://reviews.llvm.org/D128385 that we would appreciate if you included the community more by posting proposals and inviting people to review on e.g. Discourse. I see at least a couple of questions that would be good to address through a discussion involving more people:

  • IIUC, Mats has been trying "to keep things at a high-level for as long as possible" (Steve's suggestion). This patch suggests that that design principle is not a priority. Is this correct?

I believe my implementation does not violate the design principle of keeping things at a high-level for as long as possible, given the not-so-high-level FIR that we produce around TRANSPOSE currently. Let's discuss it in discourse further.

  • Do you see this as a replacement or complementary to https://reviews.llvm.org/D128385? It sounds like we may end-up with 2 different approaches for inlining in the near future. Is that the end goal?

I replied in https://discourse.llvm.org/t/snap-performance-analysis-more-detailed-than-the-presentation/60636/21. I think there is currently no bulletproof way to have optimal handling for all kinds of intrinsics ending up in runtime calls. So I think we may try different approaches. I did get some positive performance results with this implementation, and it is made in a non-intrusive way so we can replace it in future if other approaches prove to be better.

Lastly, how do you plan to integrate an llvm::cl option with the driver?

I will upload the changes soon.

-Andrzej

vzakhari added inline comments.
flang/lib/Lower/ConvertExpr.cpp
5100

Yes, I did try them, and they did not work for SPEC CPU 178.galgel case, meaning that affine promotion failed to convert the FIR loops to affine loops. After discussing this with Eric and other teammembers I decided to go with another approach to see how "bad" the current TRANSPOSE handling is. This work is represented with this commit. I agree that later inlining (and improving related optimization passes) is worth investing into. At the same time, I tried to make this implementation in a non-itrusive way, it gives immediate performance improvement and it may serve as a reference performance goal for later inlining. Moreover, the approach taken here was proposed by @klausler in https://github.com/llvm/llvm-project/blob/main/flang/docs/ArrayComposition.md a while ago, and I confirmed with @schweitz that this path is worth investigating as well.

I think the loop hoisting and CSE of TRANSPOSE is a long shot due to the opaque nature of the runtime call (moreover, taking into account that it allocates memory and has to be optimized together with its freemem companion), but these are valid examples that we may also need to address - thanks! I wonder if MLIR/LLVM optimization passes are able to hoist invariant loops out of their parent loops, which may apply to the "optimized" TRANSPOSE lowering.

Thanks for the prompt reply!

I believe the test is minimal, in particular I wanted to test the new TRANSPOSE lowering itself, but also passing result of TRANSPOSE to a subroutine and a case of using TRANSPOSE in the context of assignment with allocatable LHS (because I have made mistakes for this particular use-case while implementing this).

IMO, a minimal test could look like this:

subroutine transpose_test(x, y)
   real :: x(2,3), y(3,2)
   y = TRANSPOSE(x)
end subroutine

Introducing subroutine calls or allocatables means additional complexities. These can be helpful when testing edge cases - is this what you intended? Either way, the most basic case is not tested :) Also, the key point of this patch is to get rid of calls to the runtime (i.e. fir.call @_FortranATranspose), but that's not tested (missing CHECK-NOT: _FortranATranspose). Also, check lines like this one can be removed:

! CHECK:         %[[VAL_1:.*]] = arith.constant 2 : index

(are the values of indices important here?) Things like this have been extracted from e.g. transpose.f90 (not sure why not re-use that file for the tests added here?) and that's basically what I had in mind. Also, LIT variables have been given some meaningful names, which makes parsing the test by humans much easier (which becomes extremely helpful when triaging failures). Things like VAL_67 are not particularity helpful. Also, due to the noise these tests make it rather hard to tell the difference between -opt-transpose=true and -opt-transpose=false.

Can you please clarify what post-processing you are referring to?

I basically meant manual editing to extract the key parts relevant to the test. Also, when introducing e.g.allocatable, I would focus on testing "new" things compared to the other test(s) or things that make dealing with allocatable different/tricky. Put differently, it's hard to review ~90 lines of check lines ;-) Btw, I see that some of my points are already being addressed in https://reviews.llvm.org/D130204, thanks!

So I think we may try different approaches.

Cool, makes sense!

I did get some positive performance results with this implementation

I saw your post on Discourse, thanks for sharing!