This is an archive of the discontinued LLVM Phabricator instance.

[flang] Generate TBAA information.
ClosedPublic

Authored by vzakhari on Jan 15 2023, 11:20 PM.

Details

Summary

This is initial version of TBAA information generation for Flang
generated IR. The desired behavior is that TBAA type descriptors
are generated for FIR types during FIR to LLVM types conversion,
and then TBAA access tags are attached to memory accessing operations
when they are converted to LLVM IR dialect.

In the initial version the type conversion is not producing
TBAA type descriptors, and all memory accesses are just partitioned
into two sets of box and non-box accesses, which can never alias.

The TBAA generation is enabled by default at >O0 optimization levels.
TBAA generation may also be enabled via apply-tbaa option of
fir-to-llvm-ir conversion pass. -mllvm -disable-tbaa engineering
option allows disabling TBAA generation to override Flang's default
(e.g. when -O1 is used).

SPEC CPU2006/437.leslie3d speeds up by more than 2x on Icelake.

Depends on D141726

Diff Detail

Event Timeline

vzakhari created this revision.Jan 15 2023, 11:20 PM
vzakhari requested review of this revision.Jan 15 2023, 11:20 PM
jeanPerier accepted this revision.Jan 16 2023, 7:50 AM

I am not knowledgeable on TBAA, so I cannot asses if the metadata is created correctly, but the explanation an code adding the attributes in codegen looks great. Nice to see it has a measurable impact on performance too! (Something is wrong with the windows bot build, but I think you may just need to rebase, many other patched had similar windows failures today).

This revision is now accepted and ready to land.Jan 16 2023, 7:50 AM

Thank you for the review, Jean! Yes, it looks like Windows buildbot has been failing for a couple of days, but it should be okay after 7bf1e441da6b59a25495fde8e34939f93548cc6d. I will upload rebased files shortly.

flang/lib/Optimizer/CodeGen/CodeGen.cpp
1634

I believe it should be sourceBoxType instead of boxTy here and in the next call, because they are generating accesses of the sourceBox. I noticed this during the rebase and fixed.

clementval accepted this revision.Jan 17 2023, 6:32 AM
clementval added inline comments.
flang/lib/Optimizer/CodeGen/CodeGen.cpp
1634

Yeah sourceBoxType sounds right here.

A few nits.

flang/lib/Optimizer/CodeGen/TBAABuilder.cpp
39

Should read "In the usual"

flang/lib/Optimizer/CodeGen/TBAABuilder.h
24

Should read "type conversion and to attach llvm.tbaa attributes to memory access"

43

"parent" rather than "parrent"

47

Should read "whether an operation"

103

Should read "has to provide"

109

Did you mean to say "access's"?

152

Should read "in an incorrect result" and "in the scope"

157

Should read "under the opt-in"

172

Should read "Attach the llvm.tbaa"

vzakhari marked 7 inline comments as done.Jan 17 2023, 8:42 AM

Thank you for the comments, @PeteSteinfeld! I will upload the updated files shortly.

flang/lib/Optimizer/CodeGen/TBAABuilder.h
109

I did, but I thought the final s had to be dropped. Thanks!

157

I would prefer under an opt-in, because the option is not defined.

vzakhari updated this revision to Diff 489845.Jan 17 2023, 8:43 AM
  • Fixed the wording.
This revision was automatically updated to reflect the committed changes.
peixin added a subscriber: peixin.Mar 25 2023, 1:38 AM

Hi @vzakhari , are you still working on box/type descriptor such as type_desc_4/5/6? Or do you plan to work on that later? I am studying some optimizations which may benefit from the complete tbaa.

Hi @vzakhari , are you still working on box/type descriptor such as type_desc_4/5/6? Or do you plan to work on that later? I am studying some optimizations which may benefit from the complete tbaa.

Hi @peixin, I am not working on this. Can you please share the cases where you think it might be profitable to distinguish boxes with different number of dimensions? Is it from real applications? We can discuss it in email - it seems more convenient.

Hi @vzakhari , are you still working on box/type descriptor such as type_desc_4/5/6? Or do you plan to work on that later? I am studying some optimizations which may benefit from the complete tbaa.

Hi @peixin, I am not working on this. Can you please share the cases where you think it might be profitable to distinguish boxes with different number of dimensions? Is it from real applications? We can discuss it in email - it seems more convenient.

@SBallantyne also ran into some cases where additional tbaa information is useful. It is not clear yet whether this involves tbaa for dummy arguments as well. If we can discuss this in discourse that would be great since we can also participate or stay informed.

Hi @vzakhari , are you still working on box/type descriptor such as type_desc_4/5/6? Or do you plan to work on that later? I am studying some optimizations which may benefit from the complete tbaa.

Hi @peixin, I am not working on this. Can you please share the cases where you think it might be profitable to distinguish boxes with different number of dimensions? Is it from real applications? We can discuss it in email - it seems more convenient.

Yes, it's from one real application, and the hot loop is from one specific input. The loop is like the following where a is assumed shape array, and b is explicit-shape array. To perform the loop interchange optimization, one problem is that it is hard to understand the loop structure from the SCEV of a(i, k) access. I am thinking if the tbaa lower bound, upper bound and stride would help.

DO j=Jstart, Jend
  DO i=Istart, Iend
    DO k=1, N
      a(i,k)=0.5_8*(b(i-1,j,k) + b(i  ,j,k))
    END DO
  END DO
END DO

@SBallantyne also ran into some cases where additional tbaa information is useful. It is not clear yet whether this involves tbaa for dummy arguments as well. If we can discuss this in discourse that would be great since we can also participate or stay informed.

@kiranchandramohan @SBallantyne I am still analyzing it, and am not ready to give one clear result of how to utilize the tbaa dimentional info. I prepare to generate one by hand and proceed the analysis.

@SBallantyne also ran into some cases where additional tbaa information is useful. It is not clear yet whether this involves tbaa for dummy arguments as well. If we can discuss this in discourse that would be great since we can also participate or stay informed.

@kiranchandramohan @SBallantyne I am still analyzing it, and am not ready to give one clear result of how to utilize the tbaa dimentional info. I prepare to generate one by hand and proceed the analysis.

Thanks @peixin. We were looking at some kernels from lapack. Some of the complex type kernels give an order-of-magnitude slowdown compared to classic-flang. We see plenty of memchecks generated by the llvm vectorizer for an inner loop. But these were not present for the real type kernels. We are not yet sure whether these are caused by a lack of alias info or something specific to the Complex vectorisation in llvm.

Hi @vzakhari , are you still working on box/type descriptor such as type_desc_4/5/6? Or do you plan to work on that later? I am studying some optimizations which may benefit from the complete tbaa.

Hi @peixin, I am not working on this. Can you please share the cases where you think it might be profitable to distinguish boxes with different number of dimensions? Is it from real applications? We can discuss it in email - it seems more convenient.

Yes, it's from one real application, and the hot loop is from one specific input. The loop is like the following where a is assumed shape array, and b is explicit-shape array. To perform the loop interchange optimization, one problem is that it is hard to understand the loop structure from the SCEV of a(i, k) access. I am thinking if the tbaa lower bound, upper bound and stride would help.

DO j=Jstart, Jend
  DO i=Istart, Iend
    DO k=1, N
      a(i,k)=0.5_8*(b(i-1,j,k) + b(i  ,j,k))
    END DO
  END DO
END DO

I remember engineers from another organisation passing additional info from Classic-Flang to LLVM to make SCEV aware of descriptor info. It was a few years ago so I don't remember the details.

An alternative (as you know) would be to do the interchange in MLIR itself by promoting (if possible) to Affine. I don't know whether the present infrastructure is sufficient for it.

Thanks @peixin. We were looking at some kernels from lapack. Some of the complex type kernels give an order-of-magnitude slowdown compared to classic-flang. We see plenty of memchecks generated by the llvm vectorizer for an inner loop. But these were not present for the real type kernels. We are not yet sure whether these are caused by a lack of alias info or something specific to the Complex vectorisation in llvm.

@kiranchandramohan Is someone working on it? If not, can you send me one test case and I can take a look at it. I am also investigating the SCEV and mem checks for Fortran IR (both LLVM Flang and Classic Flang) in LV pass.

I remember engineers from another organisation passing additional info from Classic-Flang to LLVM to make SCEV aware of descriptor info. It was a few years ago so I don't remember the details.

An alternative (as you know) would be to do the interchange in MLIR itself by promoting (if possible) to Affine. I don't know whether the present infrastructure is sufficient for it.

True, doing the interchange in MLIR may be easier. But I am trying to find some way benefiting both LLVM Flang and Classic Flang. So I have to work on llvm pass.

Thanks @peixin. We were looking at some kernels from lapack. Some of the complex type kernels give an order-of-magnitude slowdown compared to classic-flang. We see plenty of memchecks generated by the llvm vectorizer for an inner loop. But these were not present for the real type kernels. We are not yet sure whether these are caused by a lack of alias info or something specific to the Complex vectorisation in llvm.

@kiranchandramohan Is someone working on it? If not, can you send me one test case and I can take a look at it. I am also investigating the SCEV and mem checks for Fortran IR (both LLVM Flang and Classic Flang) in LV pass.

@SBallantyne was investigating this. He found out that we have some downstream llvm changes that reduce the number of runtime memory checks in some strided cases. A WIP patch is available in https://reviews.llvm.org/D147542.

I remember engineers from another organisation passing additional info from Classic-Flang to LLVM to make SCEV aware of descriptor info. It was a few years ago so I don't remember the details.

An alternative (as you know) would be to do the interchange in MLIR itself by promoting (if possible) to Affine. I don't know whether the present infrastructure is sufficient for it.

True, doing the interchange in MLIR may be easier. But I am trying to find some way benefiting both LLVM Flang and Classic Flang. So I have to work on llvm pass.

OK.

Hi @vzakhari , are you still working on box/type descriptor such as type_desc_4/5/6? Or do you plan to work on that later? I am studying some optimizations which may benefit from the complete tbaa.

Hi @peixin, I am not working on this. Can you please share the cases where you think it might be profitable to distinguish boxes with different number of dimensions? Is it from real applications? We can discuss it in email - it seems more convenient.

Yes, it's from one real application, and the hot loop is from one specific input. The loop is like the following where a is assumed shape array, and b is explicit-shape array. To perform the loop interchange optimization, one problem is that it is hard to understand the loop structure from the SCEV of a(i, k) access. I am thinking if the tbaa lower bound, upper bound and stride would help.

DO j=Jstart, Jend
  DO i=Istart, Iend
    DO k=1, N
      a(i,k)=0.5_8*(b(i-1,j,k) + b(i  ,j,k))
    END DO
  END DO
END DO

Hi @peixin, sorry, I do not understand how more specific tbaa for the members of the descriptor can help here. This loop has a single "data" store for a(i,k). The loads of the descriptor members are also inside the loop, but the current tbaa should allow disambiguating the data store with the member loads. So LLVM should be able to hoist the member loads from the loop nest with the current tbaa, since they are invariants.

I suppose the load from N might be a problem for the loop interchange, since the current tbaa does not help disambiguating load from N with the store to a(i,k) inside the outer loop. But having more precise tbaa for the descriptor members won't help here as well.

peixin added a comment.Apr 7 2023, 2:08 AM

Hi @vzakhari , are you still working on box/type descriptor such as type_desc_4/5/6? Or do you plan to work on that later? I am studying some optimizations which may benefit from the complete tbaa.

Hi @peixin, I am not working on this. Can you please share the cases where you think it might be profitable to distinguish boxes with different number of dimensions? Is it from real applications? We can discuss it in email - it seems more convenient.

Yes, it's from one real application, and the hot loop is from one specific input. The loop is like the following where a is assumed shape array, and b is explicit-shape array. To perform the loop interchange optimization, one problem is that it is hard to understand the loop structure from the SCEV of a(i, k) access. I am thinking if the tbaa lower bound, upper bound and stride would help.

DO j=Jstart, Jend
  DO i=Istart, Iend
    DO k=1, N
      a(i,k)=0.5_8*(b(i-1,j,k) + b(i  ,j,k))
    END DO
  END DO
END DO

Hi @peixin, sorry, I do not understand how more specific tbaa for the members of the descriptor can help here. This loop has a single "data" store for a(i,k). The loads of the descriptor members are also inside the loop, but the current tbaa should allow disambiguating the data store with the member loads. So LLVM should be able to hoist the member loads from the loop nest with the current tbaa, since they are invariants.

I suppose the load from N might be a problem for the loop interchange, since the current tbaa does not help disambiguating load from N with the store to a(i,k) inside the outer loop. But having more precise tbaa for the descriptor members won't help here as well.

I think you are talking about the dependence analysis in loop interchange. The loop interchange pass is not enabled by default, and it heavily realies on the IR input according to my experiences and the cost model is updated last year. The problem I faced is not the dependence analysis. I used SCEV to analyze the loop structure and succeed when I work in classic-flang. But it causes some regression in some complex IR. So I am thinking if I can use the TBAA to restrict the SCEV analysis to special cases. I mean to restrict the SCEV analysis for only fortran array addressing.