This is an archive of the discontinued LLVM Phabricator instance.

[flang] Keep original data type for do-variable value.
ClosedPublic

Authored by vzakhari on Aug 18 2022, 2:31 PM.

Details

Summary

Keep the original data type of integer do-variables
for structured loops. When do-variable's data type
is an integer type shorter than IndexType, processing
the do-variable separately from the DoLoop's iteration index
allows getting rid of type casts, which can make backend
optimizations easier.

For example,

do i = 2, n-1
  do j = 2, n-1
    ... = a(j-1, i)
  end do
end do

If value of 'j' is computed by casting the DoLoop's iteration
index to 'i32', then Flang will produce the following LLVM IR:

%1 = trunc i64 %iter_index to i32
%2 = sub i32 %1, 1
%3 = sext i32 %2 to i64

LLVM's InstCombine may try to get rid of the sign extension,
and may transform this into:

%1 = shl i64 %iter_index, 32
%2 = add i64 %1, -4294967296
%3 = ashr exact i64 %2, 32

The extra computations for the element address applied on top
of this awkward pattern confuse LLVM vectorizer so that
it does not recognize the unit-strided access of 'a'.

Measured performance improvements on SPEC CPU2000@IceLake:

168.wupwise:    11.96%
171.swim:       11.22%
172.mrgid:      56.38%
178.galgel:      7.29%
301.apsi:        8.32%

Diff Detail

Event Timeline

vzakhari created this revision.Aug 18 2022, 2:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2022, 2:31 PM
vzakhari requested review of this revision.Aug 18 2022, 2:31 PM
vzakhari updated this revision to Diff 453841.Aug 18 2022, 5:22 PM

This should fix the Windows build failure.

clementval added inline comments.Aug 18 2022, 11:36 PM
flang/lib/Lower/Bridge.cpp
173

Any good reason to have 64 inlined elements here?

1248

IndexType's bit width is meant to be target specific so I guess it is normal you cannot get the information at this stage.

vzakhari added inline comments.Aug 19 2022, 8:57 AM
flang/lib/Lower/Bridge.cpp
173

No particular reason except to keep the default vector size of 64. Without specifying 64 the compilation asserts that IncrementLoopInfo is smaller than 256 bytes, otherwise, it is considered to be too big. When the vector size is explicitly specified the assertion is ignored regardless of the size.

1248

Yep, this sounds right.

clementval added inline comments.Aug 19 2022, 11:04 AM
flang/lib/Lower/Bridge.cpp
173

This is not specifying the size of the vector but the number of inlined elements.

vzakhari added inline comments.Aug 19 2022, 11:06 AM
flang/lib/Lower/Bridge.cpp
173

Right, I did not use the right words :)

clementval accepted this revision.Aug 19 2022, 11:50 AM

LGTM

flang/lib/Lower/Bridge.cpp
173

Ok then fine for me.

This revision is now accepted and ready to land.Aug 19 2022, 11:50 AM

Please wait on this until Jean has had a chance to review it.

This approach will only affect user loops. It won't affect compiler generated loops.

If a motivation for using an index type is to accommodate affine code optimization, there might be some alternatives. It might be possible to generate the loop with the original type, convert it to an index type before an affine pass, and then convert it back. Or some such. If something like that works, it might be cleaner.

flang/lib/Lower/Bridge.cpp
99

// Do loop block argument holding the current value

1246–1248

The width of an integer type is intTy.getWidth(). Doesn't that apply here?

vdonaldson added inline comments.
flang/lib/Lower/Bridge.cpp
173

Isn't a loop nest depth of 64 massive overkill? 4 would cover the vast majority of use cases.

The code originally had a lot of explicit llvm::SmallVector sizes, and someone (mlir folks?) suggested it was better to not do that. Why is this case different than other cases?

Please wait on this until Jean has had a chance to review it.

This approach will only affect user loops. It won't affect compiler generated loops.

This is correct. At the same time, the compiler generated loops mean to traverse all elements of some array, and since the extents have IndexType we have to use IndexType loops in many cases. These loops may use shorter IV's when the extents are known to fit a shorter integer type, but I am not sure how easy it is to recognize such cases in the current lowering scheme.
This change only tries to preserve what is written in the user code explicitly, i.e. if user uses INTEGER(4) IV, then it will stay in i32.

If a motivation for using an index type is to accommodate affine code optimization, there might be some alternatives. It might be possible to generate the loop with the original type, convert it to an index type before an affine pass, and then convert it back. Or some such. If something like that works, it might be cleaner.

This may require changing DoLoop definition that only supports IndexType index currently (just like scf::ForOp), but this is possible. I suppose the LLVM IR dialect lowering of such a DoLoop will have to compute the iteration count in i64 anyway to handle loops with more iterations than signed i32 can represent.

I am not sure how easy it will be to "remember" a loop's original indexing value type between affine promotion/demotion since affine transformations can delete loops (e.g. loop fusion) and create new ones (e.g. loop distribution).

I will wait for Jean's comments.

Thanks for review!

flang/lib/Lower/Bridge.cpp
99

Will change. Thanks!

1246–1248

IndexType is not an IntegerType and it does not provide means for getting its width, so there is no way to compare widths of IntegerType and IndexType.

vzakhari added inline comments.Aug 19 2022, 12:35 PM
flang/lib/Lower/Bridge.cpp
173

I guess it is not different from other cases. To select the right constant we need to know the average number of elements for all programs. I do not know it in this case, so 64, 4 and even 1 seem the same to me :) In the end we are just balancing between stack usage and heap allocation overhead.

vzakhari updated this revision to Diff 454082.Aug 19 2022, 12:47 PM

Fixed comment, and changed pre-allocated vector size to 8.

vdonaldson added inline comments.Aug 19 2022, 5:50 PM
flang/lib/Lower/Bridge.cpp
1250

A loop that hasRealControl is unstructured, so this half of the condition is a nop.

There is one alternative to your fix at the flang level that I see, but I am not sure it is legal/ok to go that route:

From a Fortran point of view, would it be allowed to use the index (i64) induction variable directly when the Fortran loop variable appears inside the loops (in places where its value is needed, and it does not need to be passed by address) If it were allowed, it would be possible to avoid generating the casts and to add an extra induction variable. But I am not sure this is legal, so your approach might just be the best workaround to this LLVM optimization issue.

To illustrate, I am wondering if it would it be legal to lower:

 do j = 2, n-1
    ... = a(j-42)
end do

into:

 fir.do_loop %induct = %lb to %ub step %step-> index {
  %induct_cast = fir.convert %induct : (index) -> i32
  fir.store %induct_cast to %induct_cast : !fir.ref<i32>
  ! Using %induct directly inside the loop
  %add = arith.subi %induct, %c42 : index
  %addr = fir.coordinate_of %a_base, %add
}

That way, the vectorizer would probably work and the generated code would be even cleaner. However, this code would behave differently in case of integer overflow, and while overflow in indexing is probably always bad (I am not sure what Fortran or flang guarantees here), I am not sure about the cases where the induction variable would appear in a general integer expression.

There is one alternative to your fix at the flang level that I see, but I am not sure it is legal/ok to go that route:

From a Fortran point of view, would it be allowed to use the index (i64) induction variable directly when the Fortran loop variable appears inside the loops (in places where its value is needed, and it does not need to be passed by address) If it were allowed, it would be possible to avoid generating the casts and to add an extra induction variable. But I am not sure this is legal, so your approach might just be the best workaround to this LLVM optimization issue.

To illustrate, I am wondering if it would it be legal to lower:

 do j = 2, n-1
    ... = a(j-42)
end do

into:

 fir.do_loop %induct = %lb to %ub step %step-> index {
  %induct_cast = fir.convert %induct : (index) -> i32
  fir.store %induct_cast to %induct_cast : !fir.ref<i32>
  ! Using %induct directly inside the loop
  %add = arith.subi %induct, %c42 : index
  %addr = fir.coordinate_of %a_base, %add
}

That way, the vectorizer would probably work and the generated code would be even cleaner. However, this code would behave differently in case of integer overflow, and while overflow in indexing is probably always bad (I am not sure what Fortran or flang guarantees here), I am not sure about the cases where the induction variable would appear in a general integer expression.

I think we do not care about the integer overflow, because a Fortran program relying on particular overflow behavior is non-conformant. At the same time, I am not sure how easy it is to extend the whole computation tree(s) dependent on the induction variable from their original data type to index. It does not sound trivial to do during the lowering.

jeanPerier accepted this revision.Aug 22 2022, 2:27 PM

I think we do not care about the integer overflow, because a Fortran program relying on particular overflow behavior is non-conformant. At the same time, I am not sure how easy it is to extend the whole computation tree(s) dependent on the induction variable from their original data type to index. It does not sound trivial to do during the lowering.

Fair point, this would complexify lowering of expressions that expects getting the same operand types. Your fix is more localized, so I think it is best to go that way for now.

vzakhari updated this revision to Diff 454623.Aug 22 2022, 2:57 PM

Removed the check for hasRealControl. Thanks, Val!

vzakhari marked an inline comment as done.Aug 22 2022, 2:57 PM
vzakhari updated this revision to Diff 454654.Aug 22 2022, 5:03 PM

Did some code restructuring after discussing it with Val.

vdonaldson accepted this revision.Aug 22 2022, 5:42 PM

Thanks Slava

This revision was automatically updated to reflect the committed changes.
vikm added a subscriber: vikm.Jun 2 2023, 3:57 AM