This is an archive of the discontinued LLVM Phabricator instance.

[LICM] - Add option to force thread model single
ClosedPublic

Authored by gsocshubham on Jul 25 2022, 1:54 AM.

Details

Summary

Load is hoisted out of loop while store is sinked if -

a. ThreadModelSingle is set using flag -licm-force-thread-model-single
b. Thread Model is set Single from Clang

This patch resolves - https://github.com/llvm/llvm-project/issues/50537.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
gsocshubham added inline comments.Aug 2 2022, 1:51 AM
llvm/lib/Transforms/Scalar/LICM.cpp
1856

Removed.

2115

I am not sure about writing to a constant. Can you elaborate with an example where it would fail?

llvm/test/Transforms/LICM/reg-promote.ll
18

I used mem2reg to remove allocas from test but it changes the state of program which will prohibit the optimization which I am proposing.

188

Removed.

Can we make this work off the existing clang option -mthread-model=single?

Use ThreadModel::Single instead of custom flag.

Can we make this work off the existing clang option -mthread-model=single?

Done.

I have removed the flag. Now, if ThreadModel::Single is set, LICM will hoist load.

fhahn added inline comments.Aug 3 2022, 9:44 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

This doesn't like quite right. You are just comparing the enum value, not the actual configured ThreadModel I think. You likely have to check the value returned by getThreadModel.

xbolva00 added inline comments.Aug 3 2022, 9:50 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

and no tests?

gsocshubham added inline comments.Aug 4 2022, 7:41 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

getThreadModel() is declared here - https://github.com/llvm/llvm-project/blob/d0541b47000739c68c540170c6b9790ec1ea3b77/llvm/include/llvm/CodeGen/CommandFlags.h#L42

Defination is present in clang - https://github.com/llvm/llvm-project/blob/448adfee05b737a26dda34e7ae2cd4948760fff0/clang/include/clang/Driver/ToolChain.h#L573

virtual std::string getThreadModel() const { return "posix"; }

I am not sure on how can I use getThreadModel() in LICM.cpp. Is there any other way to know ThreadModel?

2115

I have added a test here - hoist-load-without-store.ll where load is hoisted out of loop.

I did not get what do you mean by no tests?

xbolva00 added inline comments.Aug 4 2022, 8:04 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

So maybe clang can run llvm with

-mllvm -your-new-flag-to-allow-data-races

if threadmodel is single?

xbolva00 added inline comments.Aug 4 2022, 8:06 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

Tests that we perform this optimization only when threadmodel is single

hiraditya added inline comments.Aug 5 2022, 7:57 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

Yeah, we need test checks the optimization was performed when threadmodel is single, and does not do the optimization otherwise.

Can we make this work off the existing clang option -mthread-model=single?

-mthread-model single does not work for aarch64, sparc and I get below error without any change in the code -

clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target

Target: aarch64-unknown-linux-gnueabihf
Thread model: posix
  1. Since, mthread-model single is not supported for most of the architecture, can we keep AllowDataRaces as a separate flag independant from mthread-model?

I have compiled below testcase with options -

../install-aarch64/bin/clang -mthread-model single thread.cpp -S -Ofast
clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target

-------thread.cpp----------

int u, v, restrict, i;

void f(int a[restrict], int b[restrict], int n) {
    for (i = 0; i < n; ++i) {
        if (a[i]) {
            ++u;
            break;
        }
        ++u;
        if (b[i])
            ++v;
    }
}

Also, I tried with SPARC -

../install-sparc/bin/clang -mthread-model single thread.cpp -S -Ofast
clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target

while for thread model posix, above commands works fine.

gsocshubham added inline comments.Aug 9 2022, 2:31 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115
  1. Passing -mllvm -your-new-flag-to-allow-data-races to clang will work with my original submission - in which there is a new flag AllowDataRaces in LICM.
  1. I have tried an approach to let LICM know whether thread model is single or not as below -

a. In clang/lib/CodeGen/BackendUtil.cpp, we have thread model info using LangOpts.getThreadModel()

b. I pass a bool variable to PassBuilder which in turns pass to LICM.

c. LICM based on this bool variable knows whether to hoist the load from the loop if thread model is single or not. I did a trail approach by referring to AllowSpeculation variable which is being passes to LICM. I don't think it is optimal/suggested solution.

From my other comment, first I want to get clear with -mthread-model single option. Can you you comments on it?

gsocshubham added inline comments.Aug 9 2022, 3:42 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

If we pass thread model information as above, then we don't need flag AllowDataRaces at all. But I think there should be more efficient way apart from above - on how clang driver pass information to llvm passes.

@greened - Can you give suggestions on above comments?

I have compiled below testcase with options -

../install-aarch64/bin/clang -mthread-model single thread.cpp -S -Ofast
clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target

-------thread.cpp----------

int u, v, restrict, i;

void f(int a[restrict], int b[restrict], int n) {
    for (i = 0; i < n; ++i) {
        if (a[i]) {
            ++u;
            break;
        }
        ++u;
        if (b[i])
            ++v;
    }
}

Also, I tried with SPARC -

../install-sparc/bin/clang -mthread-model single thread.cpp -S -Ofast
clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target

while for thread model posix, above commands works fine.

Check for examples here: test/Driver/thread-model.c

And list of architectures supporting single thread model in https://clang.llvm.org/doxygen/ToolChain_8cpp_source.html#l00729

gsocshubham edited the summary of this revision. (Show Details)

Fix review comments

  1. Updated patch with updated test LICM/promote-sink-store.ll - shows load hoisting and store sinking if flag is set.
  1. Added a test LICM/without-allow-data-race.ll which prohibits load hoisting and store sinking if the flag is not set.
  1. AllowDataRaces flag can be passed directly to clang -

../install/bin/clang -Ofast -S -promote-sink-store.ll emit-llvm -target aarch64-linux -mllvm -allow-data-races

where IR is obtained from below C testcase -

int u, v;

void f(int a[restrict], int b[restrict], int n) {
    for (int i = 0; i < n; ++i) {
        if (a[i]) {
            ++u;
            break;
        }
        ++u;
        if (b[i])
            ++v;
    }
}

TODO -

  1. Remove flag -allow-data-races but how feasible it would be to pass thread model information from clang to LICM pass via PassBuilder?
gsocshubham added inline comments.Aug 12 2022, 2:43 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2115

@hiraditya - I have added test LICM/without-allow-data-race.ll which shows that optimization is not done when flag is not set - just for confirmation purpose.

I think that it will be redundant test compared to LICM/promote-sink-store.ll and I should probably remove it once patch is ready to commit.

Sink store if architecture supports thread model single.

  1. If thread model single is set, then store is sinked.
  2. For those architecture which do not support -mthread-model single, they could make use of command line argument -thread-model-single

Let me know suggestions on whether to keep (2) approach or not.

Now for arm-linux when compiled with -Ofast -

int u, v;

void f(int a[restrict], int b[restrict], int n) {
    for (int i = 0; i < n; ++i) {
        if (a[i]) {
            ++u;
            break;
        }
        ++u;
        if (b[i])
            ++v;
    }
}

we get -

target triple = "armv4t-unknown-linux"

@u = dso_local local_unnamed_addr global i32 0, align 4
@v = dso_local local_unnamed_addr global i32 0, align 4

; Function Attrs: nofree norecurse nosync nounwind
define dso_local arm_aapcscc void @f(ptr noalias nocapture noundef readonly %0, ptr noalias nocapture noundef readonly %1, i32 noundef %2) local_unnamed_addr #0 {
  %4 = load i32, ptr @u, align 4, !tbaa !5
  %5 = load i32, ptr @v, align 4, !tbaa !5
  %6 = icmp sgt i32 %2, 0
  br i1 %6, label %7, label %27

7:                                                ; preds = %3
  %8 = add i32 %4, %2
  br label %9

9:                                                ; preds = %7, %18
  %10 = phi i32 [ %25, %18 ], [ 0, %7 ]
  %11 = phi i32 [ %19, %18 ], [ %4, %7 ]
  %12 = phi i32 [ %24, %18 ], [ %5, %7 ]
  %13 = getelementptr inbounds i32, ptr %0, i32 %10
  %14 = load i32, ptr %13, align 4, !tbaa !5
  %15 = icmp eq i32 %14, 0
  br i1 %15, label %18, label %16

16:                                               ; preds = %9
  store i32 %12, ptr @v, align 4, !tbaa !5
  %17 = add nsw i32 %11, 1
  store i32 %17, ptr @u, align 4, !tbaa !5
  br label %30

18:                                               ; preds = %9
  %19 = add nsw i32 %11, 1
  %20 = getelementptr inbounds i32, ptr %1, i32 %10
  %21 = load i32, ptr %20, align 4, !tbaa !5
  %22 = icmp ne i32 %21, 0
  %23 = zext i1 %22 to i32
  %24 = add nsw i32 %12, %23
  %25 = add nuw nsw i32 %10, 1
  %26 = icmp eq i32 %25, %2
  br i1 %26, label %27, label %9, !llvm.loop !9

27:                                               ; preds = %18, %3
  %28 = phi i32 [ %5, %3 ], [ %24, %18 ]
  %29 = phi i32 [ %4, %3 ], [ %8, %18 ]
  store i32 %29, ptr @u, align 4, !tbaa !5
  store i32 %28, ptr @v, align 4, !tbaa !5
  br label %30

30:                                               ; preds = %27, %16
  ret void
}

Note that - store to u, v both got sinked out of loop.

store i32 %29, ptr @u, align 4, !tbaa !5
store i32 %28, ptr @v, align 4, !tbaa !5

What do you think about passing thread model information from clang to LICM?

a. I had to change lot many files just to pass a single bool variable - which seems not feasible.
b. I could not think of any other approach apart from above. Let me know if someone has a better approach.
c. Is there no way to know thread model from the optimizations? Is it just present in the clang frontend? Are there any other phases which uses thread model info?

Allen added a subscriber: Allen.Aug 12 2022, 6:08 AM

The clang changes are sort of right, in the sense that it's getting value of the right flag to the right place, but it's not how we pass around that sort of information.

Usually, we do one of two things:

  • Treat it as a property of the target, and add a method to TargetTransformInfo to retrieve it from the target.
  • Encode the information directly in the IR, as a function attribute or something like that.
llvm/lib/Transforms/Scalar/LICM.cpp
113

Since this is specifically for LICM, maybe call it licm-force-thread-model-single

Change thread model single licm flag name to licm-force-thread-model-single from thread-model-single.

gsocshubham added a comment.EditedAug 16 2022, 6:09 AM

The clang changes are sort of right, in the sense that it's getting value of the right flag to the right place, but it's not how we pass around that sort of information.

Usually, we do one of two things:

  • Treat it as a property of the target, and add a method to TargetTransformInfo to retrieve it from the target.

@efriedma - ThreadModel is available in TargetOptions, how can i make it visible in TargetTransformInfo?

  1. I do not see any usage of ThreadModel in CodeGen/Transforms/Target. I see it in only in the front end.
  2. The only way I can think of is to pass information from clang to CodeGen and then store it in TargetTransformInfo and then use it in LICM. Can you point me to an example which is already present which does it? It would be helpful.

Clang (exist here) -> CodeGen (how to store here) -> Transform(want to use here)

In above patch, I passed ThreadModel from Clang to Transform but passing from Clang to CodeGen seems correct. Can you point me to such example?

  • Encode the information directly in the IR, as a function attribute or something like that.

The second approach does not seem feasible in this case as the information might be lost till the time it reaches LICM. I will go with (1).

lkail added a subscriber: lkail.Aug 16 2022, 6:32 AM

The ThreadModel is passed from clang->LLVM; see llvm/include/llvm/Target/TargetOptions.h

Most interesting parts of TargetTransformInfo are implemented in llvm/include/llvm/CodeGen/BasicTTIImpl.h . In there, you should be able to go from TargetLowering->TargetMachine->TargetOptions, or something like that.

The ThreadModel is passed from clang->LLVM; see llvm/include/llvm/Target/TargetOptions.h

Most interesting parts of TargetTransformInfo are implemented in llvm/include/llvm/CodeGen/BasicTTIImpl.h . In there, you should be able to go from TargetLowering->TargetMachine->TargetOptions, or something like that.

Makes sense.

One can get TargetOptions using - TargetLowering->getTargetMachine()->DefaultOptions->ThreadModel where DefaultOptions is TargetOptions only if TargetLowering is available in LICM.

  1. Getting TargetLowering or TargetMachine in LICM does not seem pretty straight forward/directly available in LICM which implies there seems no way to know Thread Model in LICM.

-> In LICM, we have TargetTransformInfo and TargetLibraryInfo but again it does not fetch TargetLowering/TargetOptions in any way.

  1. The current implementation seems complete as per https://github.com/llvm/llvm-project/issues/50537 in which there is option -licm-force-thread-model-single if the target supports thread model single.
  1. We pass thread model from clang to licm which enables this transformation when thread model is single.

@efriedma - WDYT?

If we can get TargetLowering/TargetMachine in LICM then we do not need to pass thread model from clang to LICM. But I do not see any optimization making use of TargetLowering/TargetMachine as it is unavailable. Do you have any suggestions on it?

TargetTransformInfo can access TargetLowering. Set getTLI() in BasicTTIImpl.h.

Fix review comments -

  1. Rename LICM flag to -licm-force-thread-model-single
  1. Use thread model info from TargetOptions instead of passing from clang to llvm

TargetTransformInfo can access TargetLowering. Set getTLI() in BasicTTIImpl.h.

Understood. I have updated it.

I have fixed all the suggestions. WDYT? @eli.friedman

gsocshubham added inline comments.Aug 22 2022, 1:51 AM
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
112 ↗(On Diff #454415)

What should be the return value here? Should it be ThreadModel::INVALID or should I keep it same?

Should I add a new value to existing enum like below?

namespace ThreadModel {
  enum Model {
    INVALID // Thread model not known?
    POSIX,  // POSIX Threads
    Single  // Single Threaded Environment
  };
}

Above enum is defined in llvm/Target/TargetOptions.h

tschuett removed a subscriber: tschuett.Aug 22 2022, 1:53 AM

Use Options instead of DefaultOptions while fetching thread model.

I think that instead of "ThreadModel::Model getThreadModel()" as the TTI API, we should just make the API something like "bool isSingleThreaded()". Which is basically, can any other thread run at the "same" time. I don't really want to include TargetOptions.h everywhere, and it makes the reason we're exposing it on TargetTransformInfo a bit more clear.

I don't think you ever addressed my comment about constants. The case where that would come up as a practical issue is if you have an argument marked "deferenceable"; we know it's legal to load from an arbitrary dereferenceable pointer, but it's not legal to store to one. We can only store to locations we know are modifiable (allocas, non-constant global variables, etc.)

gsocshubham edited the summary of this revision. (Show Details)
  1. Change ThreadModel::Model getThreadModel() to bool isSingleThreaded()
  1. Remove inclusion of TargetOptions at various places.
  1. Prohibit this transformation if store points to constant memory using bool PointToConstantMemory.

I think that instead of "ThreadModel::Model getThreadModel()" as the TTI API, we should just make the API something like "bool isSingleThreaded()". Which is basically, can any other thread run at the "same" time. I don't really want to include TargetOptions.h everywhere, and it makes the reason we're exposing it on TargetTransformInfo a bit more clear.

Understood. Updated as above.
Thanks!

I don't think you ever addressed my comment about constants. The case where that would come up as a practical issue is if you have an argument marked "deferenceable"; we know it's legal to load from an arbitrary dereferenceable pointer, but it's not legal to store to one. We can only store to locations we know are modifiable (allocas, non-constant global variables, etc.)

I have added a check to avoid this transformation if store points to constant memory. Let me know WDYT on it. Have I addressed all your comments now? @eli.friedman

Prohibit this transformation if store points to constant memory using bool PointToConstantMemory.

You can't use pointsToConstantMemory to prove the property you need. The problem is that it doesn't do what you want if it can't prove anything. You need a function that says the memory isn't writable unless it can prove the memory is writable (maybe call it something like "pointsToWritableMemory").

Getting the proof requirements wrong is a common trap for newcomers to writing compiler optimizations; please watch out for it in the future.

nikic added a comment.Sep 1 2022, 12:26 AM

Prohibit this transformation if store points to constant memory using bool PointToConstantMemory.

You can't use pointsToConstantMemory to prove the property you need. The problem is that it doesn't do what you want if it can't prove anything. You need a function that says the memory isn't writable unless it can prove the memory is writable (maybe call it something like "pointsToWritableMemory").

Getting the proof requirements wrong is a common trap for newcomers to writing compiler optimizations; please watch out for it in the future.

In practice, would that be alloca and non-constant globals, or can we handle any other cases as well? I don't think doing this for noalias calls would be legal, because the noalias semantics don't say anything about the memory being writable, right?

llvm/test/Transforms/LICM/promote-sink-store.ll
1 ↗(On Diff #456922)

Please run the tests through -instnamer so the instructions have names. Also, it should be able to significantly reduce this test, it currently contains code not relevant to the transform.

You'll also want to add a negative test for the constant memory case.

nikic added a comment.Sep 1 2022, 6:40 AM

In practice, would that be alloca and non-constant globals, or can we handle any other cases as well? I don't think doing this for noalias calls would be legal, because the noalias semantics don't say anything about the memory being writable, right?

Hm ... we already assume that this is fine for noalias calls (in the non-capturing case), so I guess that's a pre-existing issue that is orthogonal to this patch. I believe the right treatment here would be to only omit the isNotCapturedBeforeOrInLoop() check in the single-thread case, but leave everything else alone.

Prohibit this transformation if store points to constant memory using bool PointToConstantMemory.

You can't use pointsToConstantMemory to prove the property you need. The problem is that it doesn't do what you want if it can't prove anything. You need a function that says the memory isn't writable unless it can prove the memory is writable (maybe call it something like "pointsToWritableMemory").

Understood. Should I add isNotCapturedBeforeOrInLoop() as an extra check as mentioned by @nikic ?

Getting the proof requirements wrong is a common trap for newcomers to writing compiler optimizations; please watch out for it in the future.

Noted. Thanks!

gsocshubham added inline comments.Sep 2 2022, 3:15 AM
llvm/test/Transforms/LICM/promote-sink-store.ll
1 ↗(On Diff #456922)

Sure. I will update it in the next push once we finalize the check on whether to use isNotCapturedBeforeOrInLoop() or should we write a custom function pointsToWritableMemory() to check alloca and non-constant globals?

gsocshubham marked an inline comment as done and an inline comment as not done.Sep 2 2022, 10:00 AM

Prohibit this transformation if store points to constant memory using bool PointToConstantMemory.

You can't use pointsToConstantMemory to prove the property you need. The problem is that it doesn't do what you want if it can't prove anything. You need a function that says the memory isn't writable unless it can prove the memory is writable (maybe call it something like "pointsToWritableMemory").

Understood. Should I add isNotCapturedBeforeOrInLoop() as an extra check as mentioned by @nikic ?

First off, please rebase your patch on main; @nikic made changes in rG315aef66.

Anyway, with that patch, the relevant bit of code currently looks like this:

Value *Object = getUnderlyingObject(SomePtr);
SafeToInsertStore =
    (isNoAliasCall(Object) || isa<AllocaInst>(Object) ||
     (isa<Argument>(Object) && cast<Argument>(Object)->hasByValAttr())) &&
    isNotCapturedBeforeOrInLoop(Object, CurLoop, DT);

There are basically two checks here: one, that the location is readable and writable (it's a malloc/alloca/byval argument/etc,), and two, that the location isn't accessed by some other thread (isNotCapturedBeforeOrInLoop).

If you know there aren't any other threads, you can just skip the isNotCapturedBeforeOrInLoop() check, and you're basically done.

On top of that, you probably also want to expand the list of locations that are known to be readable and writable to include global variables. (The current code doesn't check for them because the isNotCapturedBeforeOrInLoop() check can't succeed for global variables.)

nikic added a comment.Sep 5 2022, 1:40 AM

I pushed another change that should make this code clearer: https://github.com/llvm/llvm-project/commit/388b684354cc71bd4043ddccbcfd91fb338d8b1e In single-thread mode, the isThreadLocalObject() check can be skipped. isWritableObject() can be extended to handle constant globals.

Address review comments.

Run -instnamer on lit tests.

TODO - Add a negative test for the constant memory case.

I am thinking on what changes needs to be done on below testcase to become it as negative test -

int u, v;

void f(int a[restrict], int b[restrict], int n) {

for (int i = 0; i < n; ++i) {
    if (a[i]) {
        ++u;
        break;
    }
    ++u;
    if (b[i])
        ++v;
}

}

I had rebased with master while pushing changes. But while pushing, some other patch got merged which causes conflict.

Let me rebase it again and push it again.

gsocshubham added inline comments.Sep 8 2022, 1:32 AM
llvm/test/Transforms/LICM/promote-sink-store.ll
1 ↗(On Diff #456922)

@nikic - Can you please point me to a testcase for constant memory case?

Add global variable check for sinking store.

Add global variable check for sinking store.

@eli.friedman @nikic - Can you please give review on my latest changes?

Update SafeToInsertStore check.

Fix build bot failure caused by change in clang-format version.

Combine thread model single check with global variable check.

@eli.friedman and @nikic - Any thoughts on above changes?

nikic added inline comments.Sep 10 2022, 2:33 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2111

The check for GlobalVariable should go inside isWritableObject() and needs to check !isConstant().

xbolva00 added inline comments.Sep 10 2022, 3:11 AM
llvm/lib/Transforms/Scalar/LICM.cpp
114

You should rename this variable.

gsocshubham added inline comments.Sep 10 2022, 4:00 AM
llvm/lib/Transforms/Scalar/LICM.cpp
114

Do you have any suggestion? Probably, ForceThreadModelSingle?

Fix all review comments -

  1. Move global variable constant check inside IsWritableObject()
  1. Rename flag ThreadModelSingle to SingleThread
llvm/lib/Transforms/Scalar/LICM.cpp
114

I have renamed flag to SingleThread.

@xbolva00 - Let me know WDYT

nikic added inline comments.Sep 10 2022, 8:52 AM
llvm/lib/Transforms/Scalar/LICM.cpp
2110–2111

These checks should be in isThreadLocalObject(). If there is only a single thread, then the object is always thread-local. Your current code incorrectly skips the writability check.

Move checks (TTI->isSingleThreaded() || SingleThread) inside bool isThreadLocalObject()

gsocshubham marked an inline comment as done.Sep 11 2022, 12:36 AM
gsocshubham added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
2110–2111

Done. Thanks!

Please check it.

gsocshubham marked an inline comment as done.Sep 12 2022, 3:58 AM
gsocshubham added a subscriber: dmgreen.

Is this patch ready to be merged?

@nikic @dmgreen @eli.friedman

nikic added a comment.Sep 12 2022, 5:32 AM

I believe the logic is now correct, but testing could be improved. Some suggestions:

  • Use a single file with two RUN lines, only one of which has -licm-force-thread-model-single. Combined with --check-prefixes, this will make it clear in which cases behavior differs based on the flag and in which cases it is the same.
  • The current test cases look too complicated. For globals, I believe a very simple loop that just loads / stores the global inside the loop should be enough. Without the flag this will lead to only load promotion, and with it to store promotion.
  • We should have a variant of that test with a constant global (should have no store promotion, independent of the flag).
  • We should have a variant of the test with a function argument (unknown if writable, so should have no store promotion).
  • We should have a variant of the test with an alloca that is captured (store promotion legal with the flag, because capture does not impact thread safety).
llvm/lib/Transforms/Scalar/LICM.cpp
1908

Unused AAResults argument?

Add unit tests.

TODO - Remove attributes from test cases.

  1. Remove attributes from LIT tests.
  1. Run --instnamer on LIT tests.
  1. Rename testcases as below -

promote-sink-store-arg.ll
-> No store promotion

promote-sink-store-capture.ll
-> Store promotion

promote-sink-store-constant-global.ll
-> No store promotion

promote-sink-store-global.ll
-> Store promotion

@nikic @efriedma - Can you please give review on latest LIT tests? Is there any other way to remove alloca apart from mem2reg? Using mem2reg changes state of the program where I want to show current optimization.

I will rebase patch and update it since failure is due to ongoing patches in master.

Rebase with master branch.

gsocshubham marked an inline comment as done.Sep 28 2022, 7:12 AM
gsocshubham added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
1908

Removed.

gsocshubham marked an inline comment as done.Sep 28 2022, 7:56 AM
gsocshubham added inline comments.
llvm/test/Transforms/LICM/promote-sink-store.ll
3 ↗(On Diff #462391)

FTMS stands for Force Thread Model Single
NFTMS stands for No Force Thread Model Single

gsocshubham updated this revision to Diff 463654.EditedSep 28 2022, 12:51 PM

Address all the review comments.

  1. Add 2 RUN lines in LIT tests with prefixes - LICM-TMS and LICM-NO-TMS where TMS stands for Thread Model Single
  1. Add a minimal test for globals where store promotion will happen only if flag is set and load promotion happens irrespective of the flag.

-> promote-sink-store-global.ll

  1. Add a minimal test for constant globals where store promotion will not happen irrespective of the flag.

-> promote-sink-store-constant-global.ll

  1. Add a minimal test for function argument where store promotion will not happen.

-> promote-sink-store-arg.ll

  1. Add a minimal test for alloca that is captured where store promotion will happen.

-> promote-sink-store-capture.ll

  1. At last there is a testcase named - "promote-sink-store.ll" which is the testcase attached in original bug description from github - https://github.com/llvm/llvm-project/issues/50537
gsocshubham added inline comments.Sep 28 2022, 12:56 PM
llvm/test/Transforms/LICM/promote-sink-store-arg.ll
73 ↗(On Diff #463654)

This IR is obtained from below C code -

void f(int n, int u) {
    for (int i = 0; i < n; ++i) {
            u = i;
    }
}
llvm/test/Transforms/LICM/promote-sink-store-constant-global.ll
77 ↗(On Diff #463654)

This IR is obtained from C testcase -

const int u = 7;

void f(int n) {
     int x;
    for (int i = 0; i < n; ++i) {
        x = u;
    }
}
llvm/test/Transforms/LICM/promote-sink-store-global.ll
85 ↗(On Diff #463654)

This IR is obtained from C code -

int u, v;

void f(int n) {
    int a;
    for (int i = 0; i < n; ++i) {
            u = i;
            a = v;
    }
}
llvm/test/Transforms/LICM/promote-sink-store.ll
91 ↗(On Diff #463654)

This IR is obtained from below testcase attached from bug description -

int u, v;

void f(int a[restrict], int b[restrict], int n) {
    for (int i = 0; i < n; ++i) {
        if (a[i]) {
            ++u;
            break;
        }
        ++u;
        if (b[i])
            ++v;
    }
}

I believe the logic is now correct, but testing could be improved. Some suggestions:

  • Use a single file with two RUN lines, only one of which has -licm-force-thread-model-single. Combined with --check-prefixes, this will make it clear in which cases behavior differs based on the flag and in which cases it is the same.

Added 2 RUN lines in all the LIT tests.

  • The current test cases look too complicated. For globals, I believe a very simple loop that just loads / stores the global inside the loop should be enough. Without the flag this will lead to only load promotion, and with it to store promotion.

I have used a simple loop to achieve above.

  • We should have a variant of that test with a constant global (should have no store promotion, independent of the flag).

Done.

  • We should have a variant of the test with a function argument (unknown if writable, so should have no store promotion).

Done.

  • We should have a variant of the test with an alloca that is captured (store promotion legal with the flag, because capture does not impact thread safety).

Done.

Review request - @momchil.velikov @dmgreen @eli.friedman @nikic @fhahn

Can someone give final review? If this patch is fine, is it ready to merge?

All review comments have been addressed with no regression.

Use below testcase for capture case - test/Transforms/LICM/promote-sink-store-capture.ll

void f(int n, int u) {
    for (int i = 0, x = u; i < n; ++i) {
    x = i;
    }
}

instead of -

void f(int a[restrict], int n, int u) {
    for (int i = 0; i < n; ++i) {
    int x = u;
        if (a[i]) {
            ++x;
            break;
        }
        ++x;
    }
}
gsocshubham added inline comments.Sep 29 2022, 12:32 AM
llvm/test/Transforms/LICM/promote-sink-store-capture.ll
83 ↗(On Diff #463767)

C testcase which is used to generate below IR -

void f(int n, int u) {
    for (int i = 0, x = u; i < n; ++i) {
    x = i;
    }
}

Note - For capture case, store promotion happens irrespective of flag -licm-force-thread-model-single i.e. with just -licm with current state of LICM.

In my last to last revision, without that flag captured store promotion was not happening, maybe some other patches in between have caused it.

nikic added a comment.Sep 29 2022, 8:17 AM

Can you please run the test cases through -sroa (as in, commit the test case with SROA applied, not add it to the RUN line)? The current tests contain unnecessary alloca+load/store+lifetime.start/end.

llvm/lib/Transforms/Scalar/LICM.cpp
115

Update description to match new option name?

arsenm added a subscriber: arsenm.Sep 29 2022, 8:44 AM
arsenm added inline comments.
llvm/test/Transforms/LICM/promote-sink-store-arg.ll
98 ↗(On Diff #463767)

How is the threadedness relevant in a case that doesn't use atomic loads and stores?

gsocshubham retitled this revision from [LICM] - Add option to allow data races to [LICM] - Add option to force thread model single.

Address review comments -

  1. Run -sroa on lit tests where-ever applicable.
  1. Update LICM flag description.
  1. Add a new LIT test "promote-sink-atomic-store-arg.ll" to show atomic store behaviour.

I am able to run -sroa successfully on promote-sink-store-global.ll

I could not run sroa on below LIT tests as everything is getting optimized and there will be no load/store left to showcase this optmization effect - (Please check below godbolt links)

a. promote-sink-store-constant-global.ll - Constant global case - https://clang.godbolt.org/z/n5E69ebbz

IR obtained from -

const int u = 7;

int f(int n) {
     int x, i;
    for (i = 0; i < n; ++i) {
        x = u;
    }
  return x + u;
}

b. promote-sink-store-capture.ll - Capture case - https://clang.godbolt.org/z/fjxn7xc46

IR obtained from -

void f(int n, int u) {
    for (int i = 0, x = u; i < n; ++i) {
    x = u;
    }
}

c. promote-sink-store-arg.ll - Function argument case - https://clang.godbolt.org/z/d85qhxbsr

IR obtained from -

void f(int n, int u) {
    for (int i = 0; i < n; ++i) {
            u = i;
    }
}
gsocshubham added inline comments.Sep 30 2022, 3:50 AM
llvm/lib/Transforms/Scalar/LICM.cpp
115

Done.

llvm/test/Transforms/LICM/promote-sink-atomic-store-arg.ll
73 ↗(On Diff #464209)

This IR is obtained from below C testcase -

void f(int n, int u) {
    for (int i = 0; i < n; ++i) {
    __atomic_store_n( &u, i, __ATOMIC_SEQ_CST );
    }
}
llvm/test/Transforms/LICM/promote-sink-store-arg.ll
98 ↗(On Diff #463767)

I have added a new test named - promote-sink-atomic-store-arg.ll below which contains atomic store.

Let me know WDYT of the new test.

gsocshubham added inline comments.Sep 30 2022, 4:00 AM
llvm/test/Transforms/LICM/promote-sink-store-global.ll
85 ↗(On Diff #463654)

Now the latest IR is being generated using below C code followed by -sroa

int u, v;

int f(int n) {
    int a;
    for (int i = 0; i < n; ++i) {
            u = i;
            a = v;
    }
  return u + a;
}

@nikic - Does the patch look fine now? Is it ready to merge?

Let me know suggestions if any. Thanks for your earlier -sroa comment.

@nikic - Does the patch look fine now? Is it ready to merge?

Let me know suggestions if any. Thanks for your earlier -sroa comment.

PING.

Can you please run the test cases through -sroa (as in, commit the test case with SROA applied, not add it to the RUN line)? The current tests contain unnecessary alloca+load/store+lifetime.start/end.

I have run -sroa and shared respective dumps.

Please let me know if you need any other changes in this current patch.

llvm/lib/Transforms/Scalar/LICM.cpp
115

@nikic - Updated the description.

llvm/test/Transforms/LICM/promote-sink-store-arg.ll
98 ↗(On Diff #463767)

@arsenm - Any thoughts on above?

gsocshubham added a comment.EditedOct 6 2022, 3:58 AM

I believe the logic is now correct, but testing could be improved. Some suggestions:

  • Use a single file with two RUN lines, only one of which has -licm-force-thread-model-single. Combined with --check-prefixes, this will make it clear in which cases behavior differs based on the flag and in which cases it is the same.
  • The current test cases look too complicated. For globals, I believe a very simple loop that just loads / stores the global inside the loop should be enough. Without the flag this will lead to only load promotion, and with it to store promotion.
  • We should have a variant of that test with a constant global (should have no store promotion, independent of the flag).
  • We should have a variant of the test with a function argument (unknown if writable, so should have no store promotion).
  • We should have a variant of the test with an alloca that is captured (store promotion legal with the flag, because capture does not impact thread safety).

Should I keep both prefix checks with and without flag? That is increasing number of lines of code in the testcase.

After reducing the testcase with bare minumum load/store licm opt - Below are number of lines

promote-sink-atomic-store-arg.ll - 30 lines
promote-sink-store-arg.ll - 30 lines
promote-sink-store-capture.ll - 40 lines
promote-sink-store-constant-global.ll - 20 lines
promote-sink-store-global.ll - 20 lines
promote-sink-store.ll - 35 lines

+ checks generated by llvm/utils
+ these includes spaces in between

gsocshubham updated this revision to Diff 466013.EditedOct 7 2022, 2:22 AM

Add simple LIT tests -

  1. Remove complex testcases and add simple ones without allocas as requested by @nikic
  2. Run -instnamer and -sroa.
  3. Two checks with and without flag -licm-force-thread-model-single is shown only when behaviour differs when flag is set. (For reduced line in LIT tests). TMS stands for Thread Model Single

Note - There is alloca present in capture case as requested from above comment by @nikic We should have a variant of the test with an alloca that is captured (store promotion legal with the flag, because capture does not impact thread safety).

Behaviour of test cases -

constant-global-sroa-instnamer.ll
-> We can not store to a constant global and hence same behaviour irrespective of flag

capture-instnamer.ll
-> Store promotion happens.

arg-sroa-instnamer.ll
-> No store promotion irrespective of flag.

promote-sink-store-global.ll
-> Store promotion happens if flag is set.

gsocshubham added inline comments.Oct 7 2022, 2:29 AM
llvm/test/Transforms/LICM/promote-sink-store-arg.ll
73 ↗(On Diff #463654)

Updated test IR is obtained after running -sroad and -instnamer on -

int i, n;

void f(int u[n]) {
    for (i = 0; i < n; ++i) {
          if(u[i])
            u[i] = i;
    }
}

Here store to u will be not promoted.

llvm/test/Transforms/LICM/promote-sink-store-capture.ll
83 ↗(On Diff #463767)

Updated test IR is obtained after running -sroad and -instnamer on -

int i;

void f(int u) {
    i = 0;
    for (int x = u; i < 20; ++i) {
    x = i;
    }
}

Here store to x will be promoted.

llvm/test/Transforms/LICM/promote-sink-store-constant-global.ll
77 ↗(On Diff #463654)

Updated test IR is obtained after running -sroad and -instnamer on -

const int u;

int f(int n) {
     int x, i;
    for (i = 0; i < n; ++i) {
        x = u;
    }
  return x + u;
}
nikic added a comment.Oct 7 2022, 8:17 AM

I'm still not really happy with the minimality of these tests. They still contain many unnecessary parts. Generally, unless you are writing debuginfo tests, it is better to write test IR by hand instead of trying to generate it using clang.

I've pushed a set of minimal tests at https://github.com/llvm/llvm-project/commit/b6676f3c12588cd1333de9bb3cee3a53bc71771e. Can you please rebase over those tests, and then rerun update_test_checks.py with these adjusted RUN lines?

; RUN: opt -S -licm < %s | FileCheck %s --check-prefixes=CHECK,MT
; RUN: opt -S -licm -licm-force-thread-model-single < %s | FileCheck %s --check-prefixes=CHECK,ST

Update RUN lines with flag -licm-force-thread-model-single and rerun update_test_checks.py on promote-single-thread.ll

Remove previously submitted LIT tests.

I'm still not really happy with the minimality of these tests. They still contain many unnecessary parts. Generally, unless you are writing debuginfo tests, it is better to write test IR by hand instead of trying to generate it using clang.

Understood!

I've pushed a set of minimal tests at https://github.com/llvm/llvm-project/commit/b6676f3c12588cd1333de9bb3cee3a53bc71771e. Can you please rebase over those tests, and then rerun update_test_checks.py with these adjusted RUN lines?

; RUN: opt -S -licm < %s | FileCheck %s --check-prefixes=CHECK,MT
; RUN: opt -S -licm -licm-force-thread-model-single < %s | FileCheck %s --check-prefixes=CHECK,ST

@nikic - Thanks a lot for above tests. I have rebased and updated it accordingly.

Now, store in promote_global() and promote_captured_alloca() gets promoted as per your comments in the testcase. Is this patch good to merge now?

nikic accepted this revision.Oct 7 2022, 10:00 AM

LGTM

This revision is now accepted and ready to land.Oct 7 2022, 10:00 AM

LGTM

Thanks. Can you please commit it for me with below details? I do not have commit access.

Name - Shubham Narlawar
Email - shubham.narlawar@rrlogic.co.in

"Shubham Narlawar <shubham.narlawar@rrlogic.co.in>"

nikic added a comment.Oct 7 2022, 2:26 PM

Sure, I'll land this on Monday if nobody beats me to it.

arsenm added a comment.Oct 7 2022, 6:49 PM

How is this different from using syncscope("singlethread")?

nikic added a comment.Oct 8 2022, 1:06 AM

How is this different from using syncscope("singlethread")?

They don't have any relation, syncscope is a property of atomic operations.

To remove all the LICM-specific aspects, the question here is basically whether it is safe to introduce a spurious (non-atomic) store of the form store (load p), p without introducing a data race.

Sure, I'll land this on Monday if nobody beats me to it.

Sounds good.

It seems like everyone is fine with it!

This revision was automatically updated to reflect the committed changes.

@nikic - Thanks for comitting it.

Thanks for your help!

We see that for a single-threaded system or with option '-mllvm -licm-force-thread-model-single', the generated code quality worsens for a very simple example: https://godbolt.org/z/5xhY1EcGb.

Would it be possible to fix this?

nikic added a comment.Mar 9 2023, 8:03 AM

We see that for a single-threaded system or with option '-mllvm -licm-force-thread-model-single', the generated code quality worsens for a very simple example: https://godbolt.org/z/5xhY1EcGb.

Would it be possible to fix this?

This is not really a regression at the IR level. Ideally the loop would be removed entirely after indvars loop exit replacement, but it currently doesn't support this pattern where the value from the next-to-last iteration is taken. There's a couple open issues for that, but somehow I can't find a single one of them...

nikic added a comment.Mar 9 2023, 8:22 AM

Ah, I found one of the issues I remembered: https://github.com/llvm/llvm-project/issues/51396

Yes, it looks like the similar conditions are triggered by the code shown in that issue, but perhaps the reason isn't that indvars simplification pass couldn't really deduce the the value from last but one iteration (as seen in the example I shared: https://godbolt.org/z/5xhY1EcGb), but maybe the code structure introduced by this change?

The indvarsimplify did work fine in the absence of the option '-licm-force-thread-model-single'. With this option, the store is sunk into exit block in the first pass of LICM itself (just before loop-rotation). This then presents a different loop body to loop-rotation, which doesn't result in a guard condition. In the end, indvarsimplification see the exit block depending upon not just the induction variable value, but also on the original value of 'theGlobalVar', which perhaps make it difficult to reduce this to just an exit val computation.

As opposed to this, the case where it works (without the LICM option), the indvar simplification sees the loop as invariant i.e. the exit values just depends upon only the values from within loop, which then perhaps helps in reducing the loop to just an exit value computation.

In other words, with this LICM change, the store, which otherwise was getting skipped is now always done, making the code in the exit block to choose between the two values to store, the original value of global or the one less than exit value from the loop. The original value of the global location is propagated through loop making it more complicated than required? So, perhaps this is not required, but instead the code can be transformed such that the choice of values can be directly done in the exit block (e.g. a select between original value or computed exit val based on loop condition, similar to loop-rotation's guard condition)? This will leave the loop only with induction variable computations and thus can still be just reduced to calculation of exit values.
However, I do understand that this might out of scope of this change as it didn't introduce the transformation, but just exploited it.