This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
flang/
-
lib/Lower/
-
Lower/
3/6
ConvertExpr.cpp
-
test/Lower/Intrinsics/
-
Lower/
-
Intrinsics/
-
transpose.f90
1/2
transpose_opt.f90

Differential D129497

[flang] Lower TRANSPOSE without using runtime.
ClosedPublic

Authored by vzakhari on Jul 11 2022, 9:09 AM.

Download Raw Diff

Details

Reviewers

jeanPerier

Commits

rGa280043b5231: [flang] Lower TRANSPOSE without using runtime.

Summary

Calling runtime TRANSPOSE requires a temporary array for the result,
and, sometimes, a temporary array for the argument. Lowering it inline
should provide faster code.

I added -opt-transpose control just for debugging purposes temporary.
I am going to make driver changes that will disable inline lowering
for -O0. For the time being I would like to enable it by default
to expose the code to more tests.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vzakhari created this revision.Jul 11 2022, 9:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 9:09 AM

Herald added subscribers: mehdi_amini, jdoerfert. · View Herald Transcript

vzakhari requested review of this revision.Jul 11 2022, 9:09 AM

Harbormaster completed remote builds in B174692: Diff 443669.Jul 11 2022, 10:00 AM

Looks great to me

This revision is now accepted and ready to land.Jul 12 2022, 12:20 AM

Thank you for the review, Jean!

Closed by commit rGa280043b5231: [flang] Lower TRANSPOSE without using runtime. (authored by vzakhari). · Explain WhyJul 12 2022, 8:43 AM

This revision was automatically updated to reflect the committed changes.

vzakhari added a commit: rGa280043b5231: [flang] Lower TRANSPOSE without using runtime..

Leporacanthicus added a subscriber: Leporacanthicus.Jul 12 2022, 11:42 AM

Leporacanthicus added inline comments.

flang/lib/Lower/ConvertExpr.cpp
96	I definitely would prefer either a `-O0` turns this off. We probably should make the flang driver do this too.
5100	Did you consider making this a FIR-pass, similar to how I've done SUM-intrinsic call here: https://reviews.llvm.org/D125407 That retains the original FIR code for a while longer, potentially allowing other optimisation passes to do their thing, before the decision to inline the transpose function.
flang/test/Lower/Intrinsics/transpose_opt.f90
15	All of the checks here seem to be auto-generated. Is it actually necessary to test ALL of them? It makes it easier to see what is ACTUALLY the key parts of the test, if there's fewer CHECK lines - and it also means that if someone decides to change other aspects of the compiler, they (or someone else) won't have to change tests for that in places that don't really matter. I suspect we have (other) tests that check that we can allocate a temporary array on the heap (or on the stack), for example. Even more so for the second, longer test.

vzakhari added inline comments.Jul 12 2022, 3:45 PM

flang/lib/Lower/ConvertExpr.cpp
96	Agree. I will work on this.
5100	Thank you for the pointer! Yes, I did consider this. I found that temporary array might have already been created for the argument of `TRANSPOSE`, e.g. if it is an array expression, so just "inlining" `TRANSPOSE` late does not get rid of the temporary array overhead. Later optimizations would have to fuse the adjacent loops and optimize away the temporary. I was not able to achieve this with a manually written `FIR` and existing optimization passes, but I might have missed something. I wonder what optimization opportunities you are thinking about that we might be missing with the proposed `TRANSPOSE` lowering - can you please share?
flang/test/Lower/Intrinsics/transpose_opt.f90
15	I used `mlir/utils/generate-test-checks.py`, and, yes, we do not need to check everything here. I will try to reduce the amount of checks. I guess a `CHECK-NOT` for the `TRANSPOSE` runtime call is also missing.

I second @Leporacanthicus here. Also, please note that (from the docs):

There is a strong expectation that authors respond promptly to post-commit feedback and address it. Failure to do so is cause for the patch to be reverted.

Why haven't Matt's suggestions been addressed yet? In particular, the tests added here contain a lot of noise and do not adhere to the official guidelines (from the docs ):

Tests should be minimal, and only check what is absolutely necessary.

Using generate-test-checks.py is fine (and encouraged), but in this particular case some post-processing is clearly required. Also, generating a test with a tool doesn't automatically make it adhere to FileCheck best practices.

Lastly, inlining in LLVM Flang has been discussed extensively with the whole community on Discourse. This patch effectively bypasses any interaction with the community. I've already hinted in https://reviews.llvm.org/D128385 that we would appreciate if you included the community more by posting proposals and inviting people to review on e.g. Discourse. I see at least a couple of questions that would be good to address through a discussion involving more people:

IIUC, Mats has been trying "to keep things at a high-level for as long as possible" (Steve's suggestion). This patch suggests that that design principle is not a priority. Is this correct?
Do you see this as a replacement or complementary to https://reviews.llvm.org/D128385? It sounds like we may end-up with 2 different approaches for inlining in the near future. Is that the end goal?

Lastly, how do you plan to integrate an llvm::cl option with the driver?

-Andrzej

kiranchandramohan added subscribers: rjnw, kiranchandramohan.Jul 20 2022, 3:48 AM

kiranchandramohan added inline comments.

flang/lib/Lower/ConvertExpr.cpp
5100	I was not able to achieve this with a manually written FIR and existing optimization passes, but I might have missed something. The presentation (https://slides.com/rajanwalia/deck) by @rjnw talked about Loop Fusion in Affine FIR. Did you try the Affine Promotion and Affine passes? Also, please see the opinion express by the code-owner of this part of the code base that "it is a stronger investment to remove temporaries in the optimizer than special casing" in https://discourse.llvm.org/t/rfc-how-to-inline-fortran-inrinsics/61761/23. From this opinion what I inferred is that even if it does not remove now, this is the preferred path and we should invest to improve the optimizer to ultimately remove the temporaries. I wonder what optimization opportunities you are thinking about that we might be missing with the proposed TRANSPOSE lowering - can you please share? I can think of hoisting the call to transpose outside a loop, CSE etc.

In D129497#3665202, @awarzynski wrote:

I second @Leporacanthicus here. Also, please note that (from the docs):

There is a strong expectation that authors respond promptly to post-commit feedback and address it. Failure to do so is cause for the patch to be reverted.

Sorry for the delay. I was trying to address both Matt's concerns, i.e. disable the "optimized" lowering under -O0 and remove unnecessary checks from the LIT tests. I am still learning Flang infrastructure, so it took me a while to figure out how to properly control lowering from the driver. I am about ready to upload the changes for review, so please bear with me.

Why haven't Matt's suggestions been addressed yet? In particular, the tests added here contain a lot of noise and do not adhere to the official guidelines (from the docs ):

Tests should be minimal, and only check what is absolutely necessary.

Using generate-test-checks.py is fine (and encouraged), but in this particular case some post-processing is clearly required. Also, generating a test with a tool doesn't automatically make it adhere to FileCheck best practices.

I believe the test is minimal, in particular I wanted to test the new TRANSPOSE lowering itself, but also passing result of TRANSPOSE to a subroutine and a case of using TRANSPOSE in the context of assignment with allocatable LHS (because I have made mistakes for this particular use-case while implementing this). Can you please clarify what post-processing you are referring to?

Lastly, inlining in LLVM Flang has been discussed extensively with the whole community on Discourse. This patch effectively bypasses any interaction with the community. I've already hinted in https://reviews.llvm.org/D128385 that we would appreciate if you included the community more by posting proposals and inviting people to review on e.g. Discourse. I see at least a couple of questions that would be good to address through a discussion involving more people:

IIUC, Mats has been trying "to keep things at a high-level for as long as possible" (Steve's suggestion). This patch suggests that that design principle is not a priority. Is this correct?

I believe my implementation does not violate the design principle of keeping things at a high-level for as long as possible, given the not-so-high-level FIR that we produce around TRANSPOSE currently. Let's discuss it in discourse further.

Do you see this as a replacement or complementary to https://reviews.llvm.org/D128385? It sounds like we may end-up with 2 different approaches for inlining in the near future. Is that the end goal?

I replied in https://discourse.llvm.org/t/snap-performance-analysis-more-detailed-than-the-presentation/60636/21. I think there is currently no bulletproof way to have optimal handling for all kinds of intrinsics ending up in runtime calls. So I think we may try different approaches. I did get some positive performance results with this implementation, and it is made in a non-intrusive way so we can replace it in future if other approaches prove to be better.

Lastly, how do you plan to integrate an llvm::cl option with the driver?

I will upload the changes soon.

-Andrzej

vzakhari added subscribers: klausler, schweitz.Jul 20 2022, 1:42 PM

vzakhari added inline comments.

flang/lib/Lower/ConvertExpr.cpp
5100	Yes, I did try them, and they did not work for SPEC CPU 178.galgel case, meaning that affine promotion failed to convert the FIR loops to affine loops. After discussing this with Eric and other teammembers I decided to go with another approach to see how "bad" the current `TRANSPOSE` handling is. This work is represented with this commit. I agree that later inlining (and improving related optimization passes) is worth investing into. At the same time, I tried to make this implementation in a non-itrusive way, it gives immediate performance improvement and it may serve as a reference performance goal for later inlining. Moreover, the approach taken here was proposed by @klausler in https://github.com/llvm/llvm-project/blob/main/flang/docs/ArrayComposition.md a while ago, and I confirmed with @schweitz that this path is worth investigating as well. I think the loop hoisting and CSE of `TRANSPOSE` is a long shot due to the opaque nature of the runtime call (moreover, taking into account that it allocates memory and has to be optimized together with its `freemem` companion), but these are valid examples that we may also need to address - thanks! I wonder if MLIR/LLVM optimization passes are able to hoist invariant loops out of their parent loops, which may apply to the "optimized" `TRANSPOSE` lowering.

vzakhari mentioned this in D130204: [flang] Propagate lowering options from driver..Jul 20 2022, 2:24 PM

Thanks for the prompt reply!

In D129497#3666603, @vzakhari wrote:

I believe the test is minimal, in particular I wanted to test the new TRANSPOSE lowering itself, but also passing result of TRANSPOSE to a subroutine and a case of using TRANSPOSE in the context of assignment with allocatable LHS (because I have made mistakes for this particular use-case while implementing this).

IMO, a minimal test could look like this:

subroutine transpose_test(x, y)
   real :: x(2,3), y(3,2)
   y = TRANSPOSE(x)
end subroutine

Introducing subroutine calls or allocatables means additional complexities. These can be helpful when testing edge cases - is this what you intended? Either way, the most basic case is not tested :) Also, the key point of this patch is to get rid of calls to the runtime (i.e. fir.call @_FortranATranspose), but that's not tested (missing CHECK-NOT: _FortranATranspose). Also, check lines like this one can be removed:

! CHECK:         %[[VAL_1:.*]] = arith.constant 2 : index

(are the values of indices important here?) Things like this have been extracted from e.g. transpose.f90 (not sure why not re-use that file for the tests added here?) and that's basically what I had in mind. Also, LIT variables have been given some meaningful names, which makes parsing the test by humans much easier (which becomes extremely helpful when triaging failures). Things like VAL_67 are not particularity helpful. Also, due to the noise these tests make it rather hard to tell the difference between -opt-transpose=true and -opt-transpose=false.

Can you please clarify what post-processing you are referring to?

I basically meant manual editing to extract the key parts relevant to the test. Also, when introducing e.g.allocatable, I would focus on testing "new" things compared to the other test(s) or things that make dealing with allocatable different/tricky. Put differently, it's hard to review ~90 lines of check lines ;-) Btw, I see that some of my points are already being addressed in https://reviews.llvm.org/D130204, thanks!

So I think we may try different approaches.

Cool, makes sense!

I did get some positive performance results with this implementation

I saw your post on Discourse, thanks for sharing!

vzakhari mentioned this in D130300: [flang] Reduced CHECKs for transpose_opt.f90.Jul 21 2022, 1:19 PM

vzakhari mentioned this in rGfa3c77043800: [flang] Reduced CHECKs for transpose_opt.f90.Jul 22 2022, 8:52 AM

vzakhari mentioned this in rGf1eb945f9a50: [flang] Propagate lowering options from driver..Aug 5 2022, 11:32 AM

Revision Contents

Path

Size

flang/

lib/

Lower/

ConvertExpr.cpp

92 lines

test/

Lower/

Intrinsics/

transpose.f90

2 lines

transpose_opt.f90

134 lines

Diff 443963

flang/lib/Lower/ConvertExpr.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
// construction. Some user codes may have very large array constructors for		// construction. Some user codes may have very large array constructors for
// which the default can be increased.		// which the default can be increased.
static llvm::cl::opt<unsigned> clInitialBufferSize(		static llvm::cl::opt<unsigned> clInitialBufferSize(
"array-constructor-initial-buffer-size",		"array-constructor-initial-buffer-size",
llvm::cl::desc(		llvm::cl::desc(
"set the incremental array construction buffer size (default=32)"),		"set the incremental array construction buffer size (default=32)"),
llvm::cl::init(32u));		llvm::cl::init(32u));

		// Lower TRANSPOSE as an "elemental" function that swaps the array
		// expression's iteration space, so that no runtime call is needed.
		// This lowering may help get rid of unnecessary creation of temporary
		// arrays. Note that the runtime TRANSPOSE implementation may be different
		// from the "inline" FIR, e.g. it may diagnose out-of-memory conditions
		// during the temporary allocation whereas the inline implementation
		// relies on AllocMemOp that will silently return null in case
		// there is not enough memory. So it may be a good idea to set
		// this option to false for -O0.
		static llvm::cl::opt<bool> optimizeTranspose(
		"opt-transpose",
		llvm::cl::desc("lower transpose without using a runtime call"),
		llvm::cl::init(true));
		LeporacanthicusUnsubmitted Not Done Reply Inline Actions I definitely would prefer either a `-O0` turns this off. We probably should make the flang driver do this too. Leporacanthicus: I definitely would prefer either a `-O0` turns this off. We probably should make the flang…
		vzakhariAuthorUnsubmitted Done Reply Inline Actions Agree. I will work on this. vzakhari: Agree. I will work on this.

/// The various semantics of a program constituent (or a part thereof) as it may		/// The various semantics of a program constituent (or a part thereof) as it may
/// appear in an expression.		/// appear in an expression.
///		///
/// Given the following Fortran declarations.		/// Given the following Fortran declarations.
/// ```fortran		/// ```fortran
/// REAL :: v1, v2, v3		/// REAL :: v1, v2, v3
/// REAL, POINTER :: vp1		/// REAL, POINTER :: vp1
/// REAL :: a1(c), a2(c)		/// REAL :: a1(c), a2(c)
▲ Show 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	isIntrinsicModuleProcRef(const Fortran::evaluate::ProcedureRef &procRef) {
if (!symbol)		if (!symbol)
return false;		return false;
const Fortran::semantics::Symbol *module =		const Fortran::semantics::Symbol *module =
symbol->GetUltimate().owner().GetSymbol();		symbol->GetUltimate().owner().GetSymbol();
return module && module->attrs().test(Fortran::semantics::Attr::INTRINSIC) &&		return module && module->attrs().test(Fortran::semantics::Attr::INTRINSIC) &&
module->name().ToString().find("omp_lib") == std::string::npos;		module->name().ToString().find("omp_lib") == std::string::npos;
}		}

		// A set of visitors to detect if the given expression
		// is a TRANSPOSE call that should be lowered without using
		// runtime TRANSPOSE implementation.
		template <typename T>
		static bool isOptimizableTranspose(const T &) {
		return false;
		}

		static bool
		isOptimizableTranspose(const Fortran::evaluate::ProcedureRef &procRef) {
		const Fortran::evaluate::SpecificIntrinsic *intrin =
		procRef.proc().GetSpecificIntrinsic();
		return optimizeTranspose && intrin && intrin->name == "transpose";
		}

		template <typename T>
		static bool
		isOptimizableTranspose(const Fortran::evaluate::FunctionRef<T> &funcRef) {
		return isOptimizableTranspose(
		static_cast<const Fortran::evaluate::ProcedureRef &>(funcRef));
		}

		template <typename T>
		static bool isOptimizableTranspose(Fortran::evaluate::Expr<T> expr) {
		// If optimizeTranspose is not enabled, return false right away.
		if (!optimizeTranspose)
		return false;

		return std::visit([&](const auto &e) { return isOptimizableTranspose(e); },
		expr.u);
		}

namespace {		namespace {

/// Lowering of Fortran::evaluate::Expr<T> expressions		/// Lowering of Fortran::evaluate::Expr<T> expressions
class ScalarExprLowering {		class ScalarExprLowering {
public:		public:
using ExtValue = fir::ExtendedValue;		using ExtValue = fir::ExtendedValue;

explicit ScalarExprLowering(mlir::Location loc,		explicit ScalarExprLowering(mlir::Location loc,
▲ Show 20 Lines • Show All 2,646 Lines • ▼ Show 20 Lines

template <typename A>		template <typename A>
ExtValue gen(const Fortran::evaluate::Expr<A> &x) {		ExtValue gen(const Fortran::evaluate::Expr<A> &x) {
// Whole array symbols or components, and results of transformational		// Whole array symbols or components, and results of transformational
// functions already have a storage and the scalar expression lowering path		// functions already have a storage and the scalar expression lowering path
// is used to not create a new temporary storage.		// is used to not create a new temporary storage.
if (isScalar(x) \|\|		if (isScalar(x) \|\|
Fortran::evaluate::UnwrapWholeSymbolOrComponentDataRef(x) \|\|		Fortran::evaluate::UnwrapWholeSymbolOrComponentDataRef(x) \|\|
isTransformationalRef(x))		(isTransformationalRef(x) && !isOptimizableTranspose(x)))
return std::visit([&](const auto &e) { return genref(e); }, x.u);		return std::visit([&](const auto &e) { return genref(e); }, x.u);
if (useBoxArg)		if (useBoxArg)
return asArrayArg(x);		return asArrayArg(x);
return asArray(x);		return asArray(x);
}		}
template <typename A>		template <typename A>
ExtValue genval(const Fortran::evaluate::Expr<A> &x) {		ExtValue genval(const Fortran::evaluate::Expr<A> &x) {
if (isScalar(x) \|\| Fortran::evaluate::UnwrapWholeSymbolDataRef(x) \|\|		if (isScalar(x) \|\| Fortran::evaluate::UnwrapWholeSymbolDataRef(x) \|\|
▲ Show 20 Lines • Show All 1,787 Lines • ▼ Show 20 Lines	return
[&](const auto &) { return fir::getBase(exv); });		[&](const auto &) { return fir::getBase(exv); });
caller.placeInput(argIface, arg);		caller.placeInput(argIface, arg);
}		}
return ScalarExprLowering{loc, converter, symMap, getElementCtx()}		return ScalarExprLowering{loc, converter, symMap, getElementCtx()}
.genCallOpAndResult(caller, callSiteType, retTy);		.genCallOpAndResult(caller, callSiteType, retTy);
};		};
}		}

		/// Lower TRANSPOSE call without using runtime TRANSPOSE.
		/// Return continuation for generating the TRANSPOSE result.
		/// The continuation just swaps the iteration space before
		/// invoking continuation for the argument.
		CC genTransposeProcRef(const Fortran::evaluate::ProcedureRef &procRef) {
		LeporacanthicusUnsubmitted Not Done Reply Inline Actions Did you consider making this a FIR-pass, similar to how I've done SUM-intrinsic call here: https://reviews.llvm.org/D125407 That retains the original FIR code for a while longer, potentially allowing other optimisation passes to do their thing, before the decision to inline the transpose function. Leporacanthicus: Did you consider making this a FIR-pass, similar to how I've done SUM-intrinsic call here…
		vzakhariAuthorUnsubmitted Done Reply Inline Actions Thank you for the pointer! Yes, I did consider this. I found that temporary array might have already been created for the argument of `TRANSPOSE`, e.g. if it is an array expression, so just "inlining" `TRANSPOSE` late does not get rid of the temporary array overhead. Later optimizations would have to fuse the adjacent loops and optimize away the temporary. I was not able to achieve this with a manually written `FIR` and existing optimization passes, but I might have missed something. I wonder what optimization opportunities you are thinking about that we might be missing with the proposed `TRANSPOSE` lowering - can you please share? vzakhari: Thank you for the pointer! Yes, I did consider this. I found that temporary array might have…
		kiranchandramohanUnsubmitted Not Done Reply Inline Actions I was not able to achieve this with a manually written FIR and existing optimization passes, but I might have missed something. The presentation (https://slides.com/rajanwalia/deck) by @rjnw talked about Loop Fusion in Affine FIR. Did you try the Affine Promotion and Affine passes? Also, please see the opinion express by the code-owner of this part of the code base that "it is a stronger investment to remove temporaries in the optimizer than special casing" in https://discourse.llvm.org/t/rfc-how-to-inline-fortran-inrinsics/61761/23. From this opinion what I inferred is that even if it does not remove now, this is the preferred path and we should invest to improve the optimizer to ultimately remove the temporaries. I wonder what optimization opportunities you are thinking about that we might be missing with the proposed TRANSPOSE lowering - can you please share? I can think of hoisting the call to transpose outside a loop, CSE etc. kiranchandramohan: > I was not able to achieve this with a manually written FIR and existing optimization passes…
		vzakhariAuthorUnsubmitted Done Reply Inline Actions Yes, I did try them, and they did not work for SPEC CPU 178.galgel case, meaning that affine promotion failed to convert the FIR loops to affine loops. After discussing this with Eric and other teammembers I decided to go with another approach to see how "bad" the current `TRANSPOSE` handling is. This work is represented with this commit. I agree that later inlining (and improving related optimization passes) is worth investing into. At the same time, I tried to make this implementation in a non-itrusive way, it gives immediate performance improvement and it may serve as a reference performance goal for later inlining. Moreover, the approach taken here was proposed by @klausler in https://github.com/llvm/llvm-project/blob/main/flang/docs/ArrayComposition.md a while ago, and I confirmed with @schweitz that this path is worth investigating as well. I think the loop hoisting and CSE of `TRANSPOSE` is a long shot due to the opaque nature of the runtime call (moreover, taking into account that it allocates memory and has to be optimized together with its `freemem` companion), but these are valid examples that we may also need to address - thanks! I wonder if MLIR/LLVM optimization passes are able to hoist invariant loops out of their parent loops, which may apply to the "optimized" `TRANSPOSE` lowering. vzakhari: Yes, I did try them, and they did not work for SPEC CPU 178.galgel case, meaning that affine…
		assert(procRef.arguments().size() == 1 &&
		"TRANSPOSE must have one argument.");
		const auto *argExpr = procRef.arguments()[0].value().UnwrapExpr();
		assert(argExpr);

		llvm::SmallVector<mlir::Value> savedDestShape = destShape;
		assert((destShape.empty() \|\| destShape.size() == 2) &&
		"TRANSPOSE destination must have rank 2.");

		if (!savedDestShape.empty())
		std::swap(destShape[0], destShape[1]);

		PushSemantics(ConstituentSemantics::RefTransparent);
		llvm::SmallVector<CC> operands{genElementalArgument(*argExpr)};

		if (!savedDestShape.empty()) {
		// If destShape was set before transpose lowering, then
		// restore it. Otherwise, ...
		destShape = savedDestShape;
		} else if (!destShape.empty()) {
		// ... if destShape has been set from the argument lowering,
		// then reverse it.
		assert(destShape.size() == 2 &&
		"TRANSPOSE destination must have rank 2.");
		std::swap(destShape[0], destShape[1]);
		}

		return [=](IterSpace iters) {
		assert(iters.iterVec().size() == 2 &&
		"TRANSPOSE expects 2D iterations space.");
		IterationSpace newIters(iters, {iters.iterValue(1), iters.iterValue(0)});
		return operands.front()(newIters);
		};
		}

/// Generate a procedure reference. This code is shared for both functions and		/// Generate a procedure reference. This code is shared for both functions and
/// subroutines, the difference being reflected by `retTy`.		/// subroutines, the difference being reflected by `retTy`.
CC genProcRef(const Fortran::evaluate::ProcedureRef &procRef,		CC genProcRef(const Fortran::evaluate::ProcedureRef &procRef,
llvm::Optional<mlir::Type> retTy) {		llvm::Optional<mlir::Type> retTy) {
mlir::Location loc = getLoc();		mlir::Location loc = getLoc();

		if (isOptimizableTranspose(procRef))
		return genTransposeProcRef(procRef);

if (procRef.IsElemental()) {		if (procRef.IsElemental()) {
if (const Fortran::evaluate::SpecificIntrinsic *intrin =		if (const Fortran::evaluate::SpecificIntrinsic *intrin =
procRef.proc().GetSpecificIntrinsic()) {		procRef.proc().GetSpecificIntrinsic()) {
// All elemental intrinsic functions are pure and cannot modify their		// All elemental intrinsic functions are pure and cannot modify their
// arguments. The only elemental subroutine, MVBITS has an Intent(inout)		// arguments. The only elemental subroutine, MVBITS has an Intent(inout)
// argument. So for this last one, loops must be in element order		// argument. So for this last one, loops must be in element order
// according to 15.8.3 p1.		// according to 15.8.3 p1.
if (!retTy)		if (!retTy)
▲ Show 20 Lines • Show All 2,600 Lines • Show Last 20 Lines

flang/test/Lower/Intrinsics/transpose.f90

	! RUN: bbc -emit-fir %s -o - \| FileCheck %s			! RUN: bbc -emit-fir %s -opt-transpose=false -o - \| FileCheck %s

	! CHECK-LABEL: func @_QPtranspose_test(			! CHECK-LABEL: func @_QPtranspose_test(
	! CHECK-SAME: %[[source:.]]: !fir.ref<!fir.array<2x3xf32>>{{.}}) {			! CHECK-SAME: %[[source:.]]: !fir.ref<!fir.array<2x3xf32>>{{.}}) {
	subroutine transpose_test(mat)			subroutine transpose_test(mat)
	! CHECK: %[[resultDescr:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?x?xf32>>>			! CHECK: %[[resultDescr:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?x?xf32>>>
	real :: mat(2,3)			real :: mat(2,3)
	call bar_transpose_test(transpose(mat))			call bar_transpose_test(transpose(mat))
	! CHECK: %[[sourceBox:.]] = fir.embox %[[source]]({{.}}) : (!fir.ref<!fir.array<2x3xf32>>, !fir.shape<2>) -> !fir.box<!fir.array<2x3xf32>>			! CHECK: %[[sourceBox:.]] = fir.embox %[[source]]({{.}}) : (!fir.ref<!fir.array<2x3xf32>>, !fir.shape<2>) -> !fir.box<!fir.array<2x3xf32>>
	Show All 15 Lines

flang/test/Lower/Intrinsics/transpose_opt.f90

This file was added.

				! RUN: bbc -emit-fir %s -opt-transpose=true -o - \| FileCheck %s

				! CHECK-LABEL: func.func @_QPtranspose_test(
				! CHECK-SAME: %[[VAL_0:.*]]: !fir.ref<!fir.array<2x3xf32>> {fir.bindc_name = "mat"}) {
				subroutine transpose_test(mat)
				real :: mat(2,3)
				call bar_transpose_test(transpose(mat))
				! CHECK: %[[VAL_1:.*]] = arith.constant 2 : index
				! CHECK: %[[VAL_2:.*]] = arith.constant 3 : index
				! CHECK: %[[VAL_3:.*]] = arith.constant 3 : index
				! CHECK: %[[VAL_4:.*]] = arith.constant 2 : index
				! CHECK: %[[VAL_5:.*]] = fir.shape %[[VAL_1]], %[[VAL_2]] : (index, index) -> !fir.shape<2>
				! CHECK: %[[VAL_6:.*]] = fir.array_load %[[VAL_0]](%[[VAL_5]]) : (!fir.ref<!fir.array<2x3xf32>>, !fir.shape<2>) -> !fir.array<2x3xf32>
				! CHECK: %[[VAL_7:.*]] = fir.allocmem !fir.array<3x2xf32>
				! CHECK: %[[VAL_8:.*]] = fir.shape %[[VAL_3]], %[[VAL_4]] : (index, index) -> !fir.shape<2>
				LeporacanthicusUnsubmitted Not Done Reply Inline Actions All of the checks here seem to be auto-generated. Is it actually necessary to test ALL of them? It makes it easier to see what is ACTUALLY the key parts of the test, if there's fewer CHECK lines - and it also means that if someone decides to change other aspects of the compiler, they (or someone else) won't have to change tests for that in places that don't really matter. I suspect we have (other) tests that check that we can allocate a temporary array on the heap (or on the stack), for example. Even more so for the second, longer test. Leporacanthicus: All of the checks here seem to be auto-generated. Is it actually necessary to test ALL of them?
				vzakhariAuthorUnsubmitted Done Reply Inline Actions I used `mlir/utils/generate-test-checks.py`, and, yes, we do not need to check everything here. I will try to reduce the amount of checks. I guess a `CHECK-NOT` for the `TRANSPOSE` runtime call is also missing. vzakhari: I used `mlir/utils/generate-test-checks.py`, and, yes, we do not need to check everything here.
				! CHECK: %[[VAL_9:.*]] = fir.array_load %[[VAL_7]](%[[VAL_8]]) : (!fir.heap<!fir.array<3x2xf32>>, !fir.shape<2>) -> !fir.array<3x2xf32>
				! CHECK: %[[VAL_10:.*]] = arith.constant 1 : index
				! CHECK: %[[VAL_11:.*]] = arith.constant 0 : index
				! CHECK: %[[VAL_12:.*]] = arith.subi %[[VAL_3]], %[[VAL_10]] : index
				! CHECK: %[[VAL_13:.*]] = arith.subi %[[VAL_4]], %[[VAL_10]] : index
				! CHECK: %[[VAL_14:.]] = fir.do_loop %[[VAL_15:.]] = %[[VAL_11]] to %[[VAL_13]] step %[[VAL_10]] unordered iter_args(%[[VAL_16:.*]] = %[[VAL_9]]) -> (!fir.array<3x2xf32>) {
				! CHECK: %[[VAL_17:.]] = fir.do_loop %[[VAL_18:.]] = %[[VAL_11]] to %[[VAL_12]] step %[[VAL_10]] unordered iter_args(%[[VAL_19:.*]] = %[[VAL_16]]) -> (!fir.array<3x2xf32>) {
				! CHECK: %[[VAL_20:.*]] = fir.array_fetch %[[VAL_6]], %[[VAL_15]], %[[VAL_18]] : (!fir.array<2x3xf32>, index, index) -> f32
				! CHECK: %[[VAL_21:.*]] = fir.array_update %[[VAL_19]], %[[VAL_20]], %[[VAL_18]], %[[VAL_15]] : (!fir.array<3x2xf32>, f32, index, index) -> !fir.array<3x2xf32>
				! CHECK: fir.result %[[VAL_21]] : !fir.array<3x2xf32>
				! CHECK: }
				! CHECK: fir.result %[[VAL_22:.*]] : !fir.array<3x2xf32>
				! CHECK: }
				! CHECK: fir.array_merge_store %[[VAL_9]], %[[VAL_23:.*]] to %[[VAL_7]] : !fir.array<3x2xf32>, !fir.array<3x2xf32>, !fir.heap<!fir.array<3x2xf32>>
				! CHECK: %[[VAL_24:.*]] = fir.convert %[[VAL_7]] : (!fir.heap<!fir.array<3x2xf32>>) -> !fir.ref<!fir.array<3x2xf32>>
				! CHECK: fir.call @_QPbar_transpose_test(%[[VAL_24]]) : (!fir.ref<!fir.array<3x2xf32>>) -> ()
				! CHECK: fir.freemem %[[VAL_7]] : !fir.heap<!fir.array<3x2xf32>>
				! CHECK: return
				! CHECK: }
				end subroutine

				! CHECK-LABEL: func.func @_QPtranspose_allocatable_test(
				! CHECK-SAME: %[[VAL_0:.*]]: !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xf32>>>> {fir.bindc_name = "mat"}) {
				subroutine transpose_allocatable_test(mat)
				real, allocatable :: mat(:,:)
				mat = transpose(mat)
				! CHECK: %[[VAL_1:.*]] = fir.load %[[VAL_0]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xf32>>>>
				! CHECK: %[[VAL_2:.*]] = arith.constant 0 : index
				! CHECK: %[[VAL_3:.*]]:3 = fir.box_dims %[[VAL_1]], %[[VAL_2]] : (!fir.box<!fir.heap<!fir.array<?x?xf32>>>, index) -> (index, index, index)
				! CHECK: %[[VAL_4:.*]] = arith.constant 1 : index
				! CHECK: %[[VAL_5:.*]]:3 = fir.box_dims %[[VAL_1]], %[[VAL_4]] : (!fir.box<!fir.heap<!fir.array<?x?xf32>>>, index) -> (index, index, index)
				! CHECK: %[[VAL_6:.*]] = fir.box_addr %[[VAL_1]] : (!fir.box<!fir.heap<!fir.array<?x?xf32>>>) -> !fir.heap<!fir.array<?x?xf32>>
				! CHECK: %[[VAL_7:.*]] = fir.shape_shift %[[VAL_3]]#0, %[[VAL_3]]#1, %[[VAL_5]]#0, %[[VAL_5]]#1 : (index, index, index, index) -> !fir.shapeshift<2>
				! CHECK: %[[VAL_8:.*]] = fir.array_load %[[VAL_6]](%[[VAL_7]]) : (!fir.heap<!fir.array<?x?xf32>>, !fir.shapeshift<2>) -> !fir.array<?x?xf32>
				! CHECK: %[[VAL_9:.*]] = fir.load %[[VAL_0]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xf32>>>>
				! CHECK: %[[VAL_10:.*]] = fir.box_addr %[[VAL_9]] : (!fir.box<!fir.heap<!fir.array<?x?xf32>>>) -> !fir.heap<!fir.array<?x?xf32>>
				! CHECK: %[[VAL_11:.*]] = fir.convert %[[VAL_10]] : (!fir.heap<!fir.array<?x?xf32>>) -> i64
				! CHECK: %[[VAL_12:.*]] = arith.constant 0 : i64
				! CHECK: %[[VAL_13:.*]] = arith.cmpi ne, %[[VAL_11]], %[[VAL_12]] : i64
				! CHECK: %[[VAL_14:.*]]:2 = fir.if %[[VAL_13]] -> (i1, !fir.heap<!fir.array<?x?xf32>>) {
				! CHECK: %[[VAL_15:.*]] = arith.constant false
				! CHECK: %[[VAL_16:.*]] = arith.constant 0 : index
				! CHECK: %[[VAL_17:.*]]:3 = fir.box_dims %[[VAL_9]], %[[VAL_16]] : (!fir.box<!fir.heap<!fir.array<?x?xf32>>>, index) -> (index, index, index)
				! CHECK: %[[VAL_18:.*]] = arith.constant 1 : index
				! CHECK: %[[VAL_19:.*]]:3 = fir.box_dims %[[VAL_9]], %[[VAL_18]] : (!fir.box<!fir.heap<!fir.array<?x?xf32>>>, index) -> (index, index, index)
				! CHECK: %[[VAL_20:.*]] = arith.cmpi ne, %[[VAL_17]]#1, %[[VAL_5]]#1 : index
				! CHECK: %[[VAL_21:.*]] = arith.select %[[VAL_20]], %[[VAL_20]], %[[VAL_15]] : i1
				! CHECK: %[[VAL_22:.*]] = arith.cmpi ne, %[[VAL_19]]#1, %[[VAL_3]]#1 : index
				! CHECK: %[[VAL_23:.*]] = arith.select %[[VAL_22]], %[[VAL_22]], %[[VAL_21]] : i1
				! CHECK: %[[VAL_24:.*]] = fir.if %[[VAL_23]] -> (!fir.heap<!fir.array<?x?xf32>>) {
				! CHECK: %[[VAL_25:.*]] = fir.allocmem !fir.array<?x?xf32>, %[[VAL_5]]#1, %[[VAL_3]]#1 {uniq_name = ".auto.alloc"}
				! CHECK: %[[VAL_26:.*]] = fir.shape %[[VAL_5]]#1, %[[VAL_3]]#1 : (index, index) -> !fir.shape<2>
				! CHECK: %[[VAL_27:.*]] = fir.array_load %[[VAL_25]](%[[VAL_26]]) : (!fir.heap<!fir.array<?x?xf32>>, !fir.shape<2>) -> !fir.array<?x?xf32>
				! CHECK: %[[VAL_28:.*]] = arith.constant 1 : index
				! CHECK: %[[VAL_29:.*]] = arith.constant 0 : index
				! CHECK: %[[VAL_30:.*]] = arith.subi %[[VAL_5]]#1, %[[VAL_28]] : index
				! CHECK: %[[VAL_31:.*]] = arith.subi %[[VAL_3]]#1, %[[VAL_28]] : index
				! CHECK: %[[VAL_32:.]] = fir.do_loop %[[VAL_33:.]] = %[[VAL_29]] to %[[VAL_31]] step %[[VAL_28]] unordered iter_args(%[[VAL_34:.*]] = %[[VAL_27]]) -> (!fir.array<?x?xf32>) {
				! CHECK: %[[VAL_35:.]] = fir.do_loop %[[VAL_36:.]] = %[[VAL_29]] to %[[VAL_30]] step %[[VAL_28]] unordered iter_args(%[[VAL_37:.*]] = %[[VAL_34]]) -> (!fir.array<?x?xf32>) {
				! CHECK: %[[VAL_38:.*]] = fir.array_fetch %[[VAL_8]], %[[VAL_33]], %[[VAL_36]] : (!fir.array<?x?xf32>, index, index) -> f32
				! CHECK: %[[VAL_39:.*]] = fir.array_update %[[VAL_37]], %[[VAL_38]], %[[VAL_36]], %[[VAL_33]] : (!fir.array<?x?xf32>, f32, index, index) -> !fir.array<?x?xf32>
				! CHECK: fir.result %[[VAL_39]] : !fir.array<?x?xf32>
				! CHECK: }
				! CHECK: fir.result %[[VAL_40:.*]] : !fir.array<?x?xf32>
				! CHECK: }
				! CHECK: fir.array_merge_store %[[VAL_27]], %[[VAL_41:.*]] to %[[VAL_25]] : !fir.array<?x?xf32>, !fir.array<?x?xf32>, !fir.heap<!fir.array<?x?xf32>>
				! CHECK: fir.result %[[VAL_25]] : !fir.heap<!fir.array<?x?xf32>>
				! CHECK: } else {
				! CHECK: %[[VAL_42:.*]] = fir.shape %[[VAL_5]]#1, %[[VAL_3]]#1 : (index, index) -> !fir.shape<2>
				! CHECK: %[[VAL_43:.*]] = fir.array_load %[[VAL_10]](%[[VAL_42]]) : (!fir.heap<!fir.array<?x?xf32>>, !fir.shape<2>) -> !fir.array<?x?xf32>
				! CHECK: %[[VAL_44:.*]] = arith.constant 1 : index
				! CHECK: %[[VAL_45:.*]] = arith.constant 0 : index
				! CHECK: %[[VAL_46:.*]] = arith.subi %[[VAL_5]]#1, %[[VAL_44]] : index
				! CHECK: %[[VAL_47:.*]] = arith.subi %[[VAL_3]]#1, %[[VAL_44]] : index
				! CHECK: %[[VAL_48:.]] = fir.do_loop %[[VAL_49:.]] = %[[VAL_45]] to %[[VAL_47]] step %[[VAL_44]] unordered iter_args(%[[VAL_50:.*]] = %[[VAL_43]]) -> (!fir.array<?x?xf32>) {
				! CHECK: %[[VAL_51:.]] = fir.do_loop %[[VAL_52:.]] = %[[VAL_45]] to %[[VAL_46]] step %[[VAL_44]] unordered iter_args(%[[VAL_53:.*]] = %[[VAL_50]]) -> (!fir.array<?x?xf32>) {
				! CHECK: %[[VAL_54:.*]] = fir.array_fetch %[[VAL_8]], %[[VAL_49]], %[[VAL_52]] : (!fir.array<?x?xf32>, index, index) -> f32
				! CHECK: %[[VAL_55:.*]] = fir.array_update %[[VAL_53]], %[[VAL_54]], %[[VAL_52]], %[[VAL_49]] : (!fir.array<?x?xf32>, f32, index, index) -> !fir.array<?x?xf32>
				! CHECK: fir.result %[[VAL_55]] : !fir.array<?x?xf32>
				! CHECK: }
				! CHECK: fir.result %[[VAL_56:.*]] : !fir.array<?x?xf32>
				! CHECK: }
				! CHECK: fir.array_merge_store %[[VAL_43]], %[[VAL_57:.*]] to %[[VAL_10]] : !fir.array<?x?xf32>, !fir.array<?x?xf32>, !fir.heap<!fir.array<?x?xf32>>
				! CHECK: fir.result %[[VAL_10]] : !fir.heap<!fir.array<?x?xf32>>
				! CHECK: }
				! CHECK: fir.result %[[VAL_23]], %[[VAL_58:.*]] : i1, !fir.heap<!fir.array<?x?xf32>>
				! CHECK: } else {
				! CHECK: %[[VAL_59:.*]] = arith.constant true
				! CHECK: %[[VAL_60:.*]] = fir.allocmem !fir.array<?x?xf32>, %[[VAL_5]]#1, %[[VAL_3]]#1 {uniq_name = ".auto.alloc"}
				! CHECK: %[[VAL_61:.*]] = fir.shape %[[VAL_5]]#1, %[[VAL_3]]#1 : (index, index) -> !fir.shape<2>
				! CHECK: %[[VAL_62:.*]] = fir.array_load %[[VAL_60]](%[[VAL_61]]) : (!fir.heap<!fir.array<?x?xf32>>, !fir.shape<2>) -> !fir.array<?x?xf32>
				! CHECK: %[[VAL_63:.*]] = arith.constant 1 : index
				! CHECK: %[[VAL_64:.*]] = arith.constant 0 : index
				! CHECK: %[[VAL_65:.*]] = arith.subi %[[VAL_5]]#1, %[[VAL_63]] : index
				! CHECK: %[[VAL_66:.*]] = arith.subi %[[VAL_3]]#1, %[[VAL_63]] : index
				! CHECK: %[[VAL_67:.]] = fir.do_loop %[[VAL_68:.]] = %[[VAL_64]] to %[[VAL_66]] step %[[VAL_63]] unordered iter_args(%[[VAL_69:.*]] = %[[VAL_62]]) -> (!fir.array<?x?xf32>) {
				! CHECK: %[[VAL_70:.]] = fir.do_loop %[[VAL_71:.]] = %[[VAL_64]] to %[[VAL_65]] step %[[VAL_63]] unordered iter_args(%[[VAL_72:.*]] = %[[VAL_69]]) -> (!fir.array<?x?xf32>) {
				! CHECK: %[[VAL_73:.*]] = fir.array_fetch %[[VAL_8]], %[[VAL_68]], %[[VAL_71]] : (!fir.array<?x?xf32>, index, index) -> f32
				! CHECK: %[[VAL_74:.*]] = fir.array_update %[[VAL_72]], %[[VAL_73]], %[[VAL_71]], %[[VAL_68]] : (!fir.array<?x?xf32>, f32, index, index) -> !fir.array<?x?xf32>
				! CHECK: fir.result %[[VAL_74]] : !fir.array<?x?xf32>
				! CHECK: }
				! CHECK: fir.result %[[VAL_75:.*]] : !fir.array<?x?xf32>
				! CHECK: }
				! CHECK: fir.array_merge_store %[[VAL_62]], %[[VAL_76:.*]] to %[[VAL_60]] : !fir.array<?x?xf32>, !fir.array<?x?xf32>, !fir.heap<!fir.array<?x?xf32>>
				! CHECK: fir.result %[[VAL_59]], %[[VAL_60]] : i1, !fir.heap<!fir.array<?x?xf32>>
				! CHECK: }
				! CHECK: fir.if %[[VAL_77:.*]]#0 {
				! CHECK: fir.if %[[VAL_13]] {
				! CHECK: fir.freemem %[[VAL_10]] : !fir.heap<!fir.array<?x?xf32>>
				! CHECK: }
				! CHECK: %[[VAL_78:.*]] = fir.shape %[[VAL_5]]#1, %[[VAL_3]]#1 : (index, index) -> !fir.shape<2>
				! CHECK: %[[VAL_79:.*]] = fir.embox %[[VAL_77]]#1(%[[VAL_78]]) : (!fir.heap<!fir.array<?x?xf32>>, !fir.shape<2>) -> !fir.box<!fir.heap<!fir.array<?x?xf32>>>
				! CHECK: fir.store %[[VAL_79]] to %[[VAL_0]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?x?xf32>>>>
				! CHECK: }
				! CHECK: return
				! CHECK: }
				end subroutine

				! CHECK: func.func private @_QPbar_transpose_test(!fir.ref<!fir.array<3x2xf32>>)