This is an archive of the discontinued LLVM Phabricator instance.

[WIP][LoopIdiom] Teach LNIR to versioning loops and add runtime check
Needs ReviewPublic

Authored by eopXD on Jun 21 2021, 5:38 AM.

Details

Summary

Teach LNIR to add runtime upon the loop nest for runtime-determined store size.
Only versioning on the top-level loop is a trade-off between this optimization
and the exponential growth of versioning.

LNIR will perform identical to the current LIR if the runtime detects negative
value or too-large values.

Diff Detail

Event Timeline

eopXD created this revision.Jun 21 2021, 5:38 AM
eopXD requested review of this revision.Jun 21 2021, 5:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2021, 5:38 AM

Why is this beneficial?

llvm/test/Transforms/LoopIdiom/memset-runtime.ll
1 ↗(On Diff #353346)

Please use update script to autogenerate check lines

eopXD added a subscriber: qianzhen.Jun 21 2021, 5:50 AM

Why is this beneficial?

I shall quote from D104179 as @qianzhen stated this optimization clearly:

When the loop idiom transformation processes a memset instruction in a loop, currently it only handles the memset with a compile-time constant size. The motivation of this work is to relax this limitation, so that a memset with a variable size in a loop may still be processed and promoted to a larger memset if it passes all the eligibility checks. Performance-wise, promoting the memset in a loop to a larger memset reduces the number of calls to memset; hence reducing the overall call overhead.
A similar technique may also apply to the memcpy with a variable size in a loop.

eopXD added inline comments.Jun 21 2021, 5:55 AM
llvm/test/Transforms/LoopIdiom/memset-runtime.ll
1 ↗(On Diff #353346)

I have never auto-generated check lines before. May I ask if there is some resource for me to explore this?

Why is this beneficial?

I shall quote from D104179 as @qianzhen stated this optimization clearly:

When the loop idiom transformation processes a memset instruction in a loop, currently it only handles the memset with a compile-time constant size. The motivation of this work is to relax this limitation, so that a memset with a variable size in a loop may still be processed and promoted to a larger memset if it passes all the eligibility checks. Performance-wise, promoting the memset in a loop to a larger memset reduces the number of calls to memset; hence reducing the overall call overhead.
A similar technique may also apply to the memcpy with a variable size in a loop.

This only reiterates what this patch does, which was pretty obvious, and does not answer my question.
More concretely, can the scalar loop be expected to clear more than i64 -1 bytes of memory?
If not, why can't the byte count to be cleared can't be calculated by promoting to i64, and multiplying with saturation?

I am quite skeptical how useful this can be in practise. Any motivation examples from real world codes/benchmarks?

llvm/test/Transforms/LoopIdiom/memset-runtime.ll
7 ↗(On Diff #353346)

LoopFlatten pass could convert two loops to one, no? Than you dont need to teach this pass about this pattern.

Btw, is this common pattern? Can you show us some motivation examples? Some hits in benchmarks?

22 ↗(On Diff #353346)

So we have original loops and memset, which could be expanded to series of stores. Nightmare for codesize..

eopXD updated this revision to Diff 353584.Jun 22 2021, 2:27 AM

Make testcase more precise to optimization this patch provides.

eopXD added a comment.Jun 22 2021, 2:40 AM

Thank you @lebedev.ri and @xbolva00 for leaving comments.
I would follow up with benchmarks of the patch.

llvm/test/Transforms/LoopIdiom/memset-runtime.ll
7 ↗(On Diff #353346)

Yes, LoopFlatten does convert two loops into one. Sorry that my example is misleading. I have adjusted the test case so LoopFlatten is irrelevant.

In the new test case the memset is runtime determined and cannot be optimized by current LoopIdiomRecognize, which this patch deals with the scenario.

To deal with the scenario there should be runtime checks to make sure the optimization is safe, so versioning is needed. To avoid exponential versioning growths, we can do 1 versioning for every nested loop. That is where the LoopNestPass comes in handy. The pass runs on LoopNest which constructs on a whole nested loop (doesn't run on a sub-loop).

22 ↗(On Diff #353346)

I think this pass does the opposite. It processes series of store of a loop into a single memset.

The code size increase scenario by this patch is when versioning happens.
(Original LoopIdiomRecognize doesn't involve versioning)

lebedev.ri requested changes to this revision.Jun 22 2021, 2:47 AM
lebedev.ri added inline comments.
llvm/test/Transforms/LoopIdiom/memset-runtime.ll
22 ↗(On Diff #353346)

The code size increase scenario by this patch is when versioning happens.

Isn't that the @xbolva00's comment in the first place?

This revision now requires changes to proceed.Jun 22 2021, 2:47 AM
eopXD marked 2 inline comments as not done.Jun 22 2021, 2:49 AM
eopXD added a comment.Jun 22 2021, 2:57 AM

Accidentally marked done, undone-ed it.
No bad intentions there.

llvm/test/Transforms/LoopIdiom/memset-runtime.ll
22 ↗(On Diff #353346)

I think I have misunderstood the english. I originally thought series of store are multiple store instructions.

Yes, the versioning would make code size increase. I think it should be on the user whether they want this optimization to be turnt on. I added ForceNoLoopVersion as a compile option. If ForceNoLoopVersion = true, then no runtime checks will be added and the pass will only process on constant sizes.

eopXD updated this revision to Diff 354152.Jun 23 2021, 9:20 PM

Fix some clang-format.

fhahn added inline comments.Jun 25 2021, 7:15 AM
llvm/test/Transforms/LoopIdiom/memset-runtime.ll
36 ↗(On Diff #354152)

I'm not sure if I am missing something, but this test has just a single loop, so it's a 'trivial' loop nest, right?

Couldn't this case be handled without making it a loop nest pass, by just checking if there's no parent loop? And even if there's a parent loop, wouldn't it also be possible to transform the inner loop to a wider memset, without the parents being perfectly nested?

1 ↗(On Diff #353346)

You can use the llvm/utils/update_test_checks.py script to automatically generate check lines.

eopXD updated this revision to Diff 354652.Jun 26 2021, 1:21 AM

update code to let it run correctly.

eopXD updated this revision to Diff 354663.Jun 26 2021, 2:59 AM
eopXD marked an inline comment as done.

Generate check lines with script.

eopXD added inline comments.Jun 26 2021, 3:28 AM
llvm/test/Transforms/LoopIdiom/memset-runtime.ll
36 ↗(On Diff #354152)

Yes you are right. This example does not show why LoopNest is needed.

Current LIR only deals with constant store size (or memset size). This patch uses SCEV to let LIR deal with runtime sizes. If we deal with runtime sizes, we will need to add runtime check for it.

if (runtime_check) {
  //optimized code
} else 
  // code that does not optimize runtime variables
}

With LIR, we can certainly transfer loops from inner loop to a wider one until it reaches the outer most. However if we perform it layer by layer, the total versions created will increase exponentially. To prevent this from happening, we create only 1 if-else for versioning at the top-loop. This is where LoopNest comes in.

Thank you for reminding, I shall add an example to show this.

eopXD updated this revision to Diff 358877.EditedJul 15 2021, 2:38 AM

Address comments.

  1. Make processLoopMemset able to deal with smax expressions with constant operands that are loop guards.

    How: If operands are constant, we can fold them and add runtime checks when we version and make sure our folding on smax is safe.
  1. Added nested testcase that shows the necessity of using LoopNest.

    Why: If we use the current LIR with this patch, which seeks to optimize on runtime determined variables, the versioning would happen loop-by-loop. In other word, the pass would need to generate versioning for every loop until the pass hoists the memset to the outermost-loop's preheader. The LNIR (LoopNestIdiomRecognize) would make only 1 versioning outside of the outer-most loop to prevent this from happening.
eopXD added a comment.Jul 15 2021, 2:49 AM

@lebedev.ri

May I ask what is your main concern to this patch?
I would be glad to discuss and resolve it.

eopXD added inline comments.Jul 15 2021, 3:02 AM
llvm/test/Transforms/LoopIdiom/memset-runtime.ll
36 ↗(On Diff #354152)

Hi @fhahn,
I have added nested test-case which shows improvement of this patch (support IdiomRecognize optimization to runtime-determined variables) and the necessity of LoopNest structure (version on top-level loop).
May I mark thread as done?

Just minor comment regarding potential benefits and benchmarking. If I understand patch correctly it flatten multilevel loop with memset into one memset. According simple microbenchmarks: https://godbolt.org/z/MTnYcvvYo on my x86-64 Skylake flattening memset in double loop (memset_3D) into 1 memset gives between even 800% performance boost (on small WS) to ~80% boost (when WS > LLC).
But how useful such transformation would be in practice? I'm not sure. We need to keep in mind that memset is usually just part of initialization/reusing memory code so in real world benchmarks flattening memsets loop may be less beneficial than microbenchmarks shows.

eopXD added a comment.Jul 18 2021, 6:50 AM

gentle ping

But how useful such transformation would be in practice? I'm not sure. We need to keep in mind that memset is usually just part of initialization/reusing memory code so in real world benchmarks flattening memsets loop may be less beneficial than microbenchmarks shows.

In Fortran, initialization of multi-dimensional arrays uses one liners like

A = 0.0

When A is multi-dimensional, it would be converted to loop-nests in LLVM-IR. For example if A is 3-D then the LLVM IR would be like the following. This nested store operation pattern is a common pattern to exist in the language of Fortran.

for i ...
  for j ...
    for k ...
      A[i][j][k] = 0;

I'm not sure we actually need to version the loop to handle overflow.

In any given address-space, there are at most 2^N bytes addressable by "gep ptr, i" where N is DataLayout::getIndexTypeSizeInBits(). Assuming the memset isn't volatile, there isn't any point to storing to an address more than once. So we only actually need to memset min(2^N, TotalBytes) bytes. You should be able to pass that number to memset.

I'm not sure we actually need to version the loop to handle overflow.

That was my comment, too, and i've yet to see a concise explanation why that isn't the case.

In any given address-space, there are at most 2^N bytes addressable by "gep ptr, i" where N is DataLayout::getIndexTypeSizeInBits(). Assuming the memset isn't volatile, there isn't any point to storing to an address more than once. So we only actually need to memset min(2^N, TotalBytes) bytes. You should be able to pass that number to memset.

eopXD added a comment.Jul 22 2021, 9:16 AM

Hi @efriedma,

I am not sure consecutive memory won't exceed 2^N bytes, but we cannot think of a scenario that it can. Thank you for stating this valid point.

However the current code inside LoopIdiomRecognize::processLoopMemset , for the constant size the pass would also check for size overflow. I am curious of why it is originally needed here? Do you have any idea why is the check needed? (the commit history is out of the Phabricator's reach)

// Reject memsets that are so large that they overflow an unsigned.
  uint64_t SizeInBytes = cast<ConstantInt>(MSI->getLength())->getZExtValue();
  if ((SizeInBytes >> 32) != 0)
    return false;

Assume we don't need runtime check for size overflow, we still need runtime checks for the assumptions made to fold SCEV expressions that enable the optimization. So versioning would still be needed.

However the current code inside LoopIdiomRecognize::processLoopMemset , for the constant size the pass would also check for size overflow. I am curious of why it is originally needed here? Do you have any idea why is the check needed? (the commit history is out of the Phabricator's reach)

Code originates from rG8643810eded6eef8ad2753478a8403437695228f . As for why, not sure. There isn't any obvious reason it's necessary.

Assume we don't need runtime check for size overflow, we still need runtime checks for the assumptions made to fold SCEV expressions that enable the optimization. So versioning would still be needed.

Why can't we use the SCEV expression for the trip count as-is?

eopXD added a comment.Jul 23 2021, 9:38 AM

Assume we don't need runtime check for size overflow, we still need runtime checks for the assumptions made to fold SCEV expressions that enable the optimization. So versioning would still be needed.

Why can't we use the SCEV expression for the trip count as-is?

The SCEV of trip count may contain smax which would make the memset size not comparable in processLoopMemset. To conservatively fold the smax expression a runtime check is added.

In a do-while loop of the following, the trip-count SCEV would be (1 smax (sext i32 %n to i64)). To fold the SCEV we add run-check condition for n >= 1, and fold the expression into sext i32 %n to i64.

int i = 0;
do {
  ++i;
}  while (i < n);

An example would be the 2nd test case in memset-runtime.ll.
For the inner-loop, the optimized memset instruction would have the memsetSizeSCEV contain smax expression since NumBytesSCEV = TripCountSCEV * StoreSizeSCEV

Calculate NumBytesSCEV = TripCountSCEV * StoreSizeSCEV
  TripCountSCEV: (1 smax (sext i32 %m to i64))
  StoreSizeSCEV: (4 * (sext i32 %o to i64))<nsw>
  NumBytesSCEV: (4 * (sext i32 %o to i64) * (1 smax (sext i32 %m to i64)))

Then in the outer-loop when the pass want to compare check whether memsetSizeSCEV == pointerStrideSCEV we need to fold the expression to eliminate smax so that the comparison can be performed. If the folded expression is successful then the pass would proceed to perform the optimization for the outer loop.

MemsetSizeSCEV: (4 * (sext i32 %o to i64) * (1 smax (sext i32 %m to i64)))
PositiveStrideSCEV: (4 * (sext i32 %m to i64) * (sext i32 %o to i64))
Try to convert SCEV expression and compare again
  MemsetSCEVConv: (4 * (zext i32 %m to i64) * (zext i32 %o to i64))
  PositiveStrideSCEVConv: (4 * (zext i32 %m to i64) * (zext i32 %o to i64))

Yes, I guess you need to version to handle that case in particular. I'd like to avoid versioning in simpler cases, though.

I have to ask; is that case actually showing up in real code? Usually, people prefer for loops over do-while loops in situations like that. If you write an equivalent test using for loops, SCEV should simplify the backedge-taken count.

eopXD updated this revision to Diff 361419.Jul 24 2021, 1:07 AM

Address comments.

  • no need to check overflow, the assumption is no consecutive memory exceeds 2^N (where N is DataLayout::getIndexTypeSizeInBits()).
  • add nested for-loop to testcase
eopXD added a comment.EditedJul 24 2021, 1:08 AM

SCEV folding is still needed for for-loop test case.

void test(int n, int m, int o, int *ar) {
  for (int i=0; i<n; ++i) {
    for (int j=0; j<m; ++j) {
      int *arr = ar + i * m * o + j * o;
      memset(arr, 0, o * sizeof(int));      
    }
  }
}

For the inner loop:

MemsetSizeSCEV: (4 * (sext i32 %o to i64))<nsw>
PositiveStrideSCEV: (4 * (sext i32 %o to i64))<nsw>
Calculate NumBytesS = TripCountS * StoreSizeSCEV
  TripCountS: (zext i32 %m to i64)
  StoreSizeSCEV: (4 * (sext i32 %o to i64))<nsw>
  NumBytesS: (4 * (zext i32 %m to i64) * (sext i32 %o to i64))

Then in the outer-loop, to compare MemsetSizeSCEV to the PointerStrideSCEV, the pass should convert sext to zext for comparison, which is the SCEV folding and runtime check is added.

MemsetSizeSCEV: (4 * (zext i32 %m to i64) * (sext i32 %o to i64))
PositiveStrideSCEV: (4 * (sext i32 %m to i64) * (sext i32 %o to i64))
Try to convert SCEV expression and compare again
  MemsetSCEVConv: (4 * (zext i32 %m to i64) * (zext i32 %o to i64))
  PositiveStrideSCEVConv: (4 * (zext i32 %m to i64) * (zext i32 %o to i64))

Oh, hmm. Yes, you need to do some sort of SCEV folding to handle that case, but you can perform the check at compile-time using something like ScalarEvolution::isLoopEntryGuardedByCond.

  • no need to check overflow, the assumption is no consecutive memory exceeds 2^N (where N is DataLayout::getIndexTypeSizeInBits()).

Maybe to be conservative, restrict this to memset in address-space zero.

  • no need to check overflow, the assumption is no consecutive memory exceeds 2^N (where N is DataLayout::getIndexTypeSizeInBits()).

Maybe to be conservative, restrict this to memset in address-space zero.

The reason for this is that in other address-spaces, you might need to saturate the memset amount, instead of just ignoring overflow, I think.

eopXD updated this revision to Diff 363694.Aug 3 2021, 4:41 AM
This comment was removed by eopXD.
eopXD updated this revision to Diff 363696.Aug 3 2021, 4:44 AM

Revert to previous edition.

eopXD updated this revision to Diff 363701.Aug 3 2021, 4:55 AM

Address comments.

For SCEV expression MemsetSize and PointerStride of a memset instruction, the pass
would first fold expressions that is already guarded by the loop guard. If the
folded expressions are not equal, LoopIdiomRecognize will abort. LoopNestIdiomRecognize
would proceed to try convert the SCEV and add appropriate runtime checks.
If the converted expression are equal, LNIR would proceed the optimization and do
versioning at the outer-most loop.

Added separate testcase files for LIR and LNIR.
For LIR, it can deal with runtime determined variables when runtime check is required.
For LNIR, it will do versioning and add runtime check if needed.

High-level comment: let's not convolute things too much.
I'm honestly already at loss with the current diff, and (to me!) not very clear answers to the questions.
Let's at least deal with proper loop-idiom patch first,
and maybe follow-up with a loop-nest loop idiom changes.

eopXD added a comment.Aug 3 2021, 5:02 AM

High-level comment: let's not convolute things too much.
I'm honestly already at loss with the current diff, and (to me!) not very clear answers to the questions.
Let's at least deal with proper loop-idiom patch first,
and maybe follow-up with a loop-nest loop idiom changes.

Sure.
Let me open a new patch on the proper loop-idiom and maybe come back here after that.
Thank you for following up with this patch.

HLJ2009 added a subscriber: HLJ2009.Aug 3 2021, 5:03 AM
eopXD added a comment.Aug 3 2021, 8:01 AM

Created a new patch D107353 that deals with the proper loop-idiom.
Will come back and modify after D107353 is accepted.

Whitney added a project: Restricted Project.Dec 1 2021, 8:27 AM
eopXD retitled this revision from [LoopIdiom] [LoopNest] let the pass deal with runtime memset size to [LoopIdiom] Teach LNIR to versioning loops and add runtime check.Dec 16 2021, 12:33 AM
eopXD edited the summary of this revision. (Show Details)

Patch description does not explain why versioning is needed in the first place.

Patch description does not explain why versioning is needed in the first place.

Yes you are right. I will need some time to rebase and update this patch.

eopXD added a comment.EditedDec 16 2021, 12:45 AM

Also, I understand that the LNIR idea might not ultimately justify itself to be merged into upstream.
So after this patch is ready for review maybe we can have a discussion on the LoopOpt Work Group.

eopXD retitled this revision from [LoopIdiom] Teach LNIR to versioning loops and add runtime check to [WIP][LoopIdiom] Teach LNIR to versioning loops and add runtime check.Dec 16 2021, 1:05 AM
lebedev.ri resigned from this revision.Jan 12 2023, 5:15 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:15 PM
Herald added a subscriber: StephenFan. · View Herald Transcript