Load is hoisted out of loop while store is sinked if -
a. ThreadModelSingle is set using flag -licm-force-thread-model-single
b. Thread Model is set Single from Clang
This patch resolves - https://github.com/llvm/llvm-project/issues/50537.
Differential D130466
[LICM] - Add option to force thread model single gsocshubham on Jul 25 2022, 1:54 AM. Authored by
Details Load is hoisted out of loop while store is sinked if - a. ThreadModelSingle is set using flag -licm-force-thread-model-single This patch resolves - https://github.com/llvm/llvm-project/issues/50537.
Diff Detail
Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes
Comment Actions Done. I have removed the flag. Now, if ThreadModel::Single is set, LICM will hoist load.
Comment Actions -mthread-model single does not work for aarch64, sparc and I get below error without any change in the code - clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target Target: aarch64-unknown-linux-gnueabihf Thread model: posix
Comment Actions I have compiled below testcase with options - ../install-aarch64/bin/clang -mthread-model single thread.cpp -S -Ofast clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target -------thread.cpp---------- int u, v, restrict, i; void f(int a[restrict], int b[restrict], int n) { for (i = 0; i < n; ++i) { if (a[i]) { ++u; break; } ++u; if (b[i]) ++v; } } Also, I tried with SPARC - ../install-sparc/bin/clang -mthread-model single thread.cpp -S -Ofast clang-15: error: invalid thread model 'single' in '-mthread-model single' for this target while for thread model posix, above commands works fine.
Comment Actions Check for examples here: test/Driver/thread-model.c And list of architectures supporting single thread model in https://clang.llvm.org/doxygen/ToolChain_8cpp_source.html#l00729 Comment Actions Fix review comments
../install/bin/clang -Ofast -S -promote-sink-store.ll emit-llvm -target aarch64-linux -mllvm -allow-data-races where IR is obtained from below C testcase - int u, v; void f(int a[restrict], int b[restrict], int n) { for (int i = 0; i < n; ++i) { if (a[i]) { ++u; break; } ++u; if (b[i]) ++v; } } TODO -
Comment Actions Sink store if architecture supports thread model single.
Let me know suggestions on whether to keep (2) approach or not. Comment Actions Now for arm-linux when compiled with -Ofast - int u, v; void f(int a[restrict], int b[restrict], int n) { for (int i = 0; i < n; ++i) { if (a[i]) { ++u; break; } ++u; if (b[i]) ++v; } } we get - target triple = "armv4t-unknown-linux" @u = dso_local local_unnamed_addr global i32 0, align 4 @v = dso_local local_unnamed_addr global i32 0, align 4 ; Function Attrs: nofree norecurse nosync nounwind define dso_local arm_aapcscc void @f(ptr noalias nocapture noundef readonly %0, ptr noalias nocapture noundef readonly %1, i32 noundef %2) local_unnamed_addr #0 { %4 = load i32, ptr @u, align 4, !tbaa !5 %5 = load i32, ptr @v, align 4, !tbaa !5 %6 = icmp sgt i32 %2, 0 br i1 %6, label %7, label %27 7: ; preds = %3 %8 = add i32 %4, %2 br label %9 9: ; preds = %7, %18 %10 = phi i32 [ %25, %18 ], [ 0, %7 ] %11 = phi i32 [ %19, %18 ], [ %4, %7 ] %12 = phi i32 [ %24, %18 ], [ %5, %7 ] %13 = getelementptr inbounds i32, ptr %0, i32 %10 %14 = load i32, ptr %13, align 4, !tbaa !5 %15 = icmp eq i32 %14, 0 br i1 %15, label %18, label %16 16: ; preds = %9 store i32 %12, ptr @v, align 4, !tbaa !5 %17 = add nsw i32 %11, 1 store i32 %17, ptr @u, align 4, !tbaa !5 br label %30 18: ; preds = %9 %19 = add nsw i32 %11, 1 %20 = getelementptr inbounds i32, ptr %1, i32 %10 %21 = load i32, ptr %20, align 4, !tbaa !5 %22 = icmp ne i32 %21, 0 %23 = zext i1 %22 to i32 %24 = add nsw i32 %12, %23 %25 = add nuw nsw i32 %10, 1 %26 = icmp eq i32 %25, %2 br i1 %26, label %27, label %9, !llvm.loop !9 27: ; preds = %18, %3 %28 = phi i32 [ %5, %3 ], [ %24, %18 ] %29 = phi i32 [ %4, %3 ], [ %8, %18 ] store i32 %29, ptr @u, align 4, !tbaa !5 store i32 %28, ptr @v, align 4, !tbaa !5 br label %30 30: ; preds = %27, %16 ret void } Note that - store to u, v both got sinked out of loop. store i32 %29, ptr @u, align 4, !tbaa !5 store i32 %28, ptr @v, align 4, !tbaa !5 Comment Actions What do you think about passing thread model information from clang to LICM? a. I had to change lot many files just to pass a single bool variable - which seems not feasible. Comment Actions The clang changes are sort of right, in the sense that it's getting value of the right flag to the right place, but it's not how we pass around that sort of information. Usually, we do one of two things:
Comment Actions Change thread model single licm flag name to licm-force-thread-model-single from thread-model-single. Comment Actions @efriedma - ThreadModel is available in TargetOptions, how can i make it visible in TargetTransformInfo?
Clang (exist here) -> CodeGen (how to store here) -> Transform(want to use here) In above patch, I passed ThreadModel from Clang to Transform but passing from Clang to CodeGen seems correct. Can you point me to such example?
The second approach does not seem feasible in this case as the information might be lost till the time it reaches LICM. I will go with (1). Comment Actions The ThreadModel is passed from clang->LLVM; see llvm/include/llvm/Target/TargetOptions.h Most interesting parts of TargetTransformInfo are implemented in llvm/include/llvm/CodeGen/BasicTTIImpl.h . In there, you should be able to go from TargetLowering->TargetMachine->TargetOptions, or something like that. Comment Actions Makes sense. One can get TargetOptions using - TargetLowering->getTargetMachine()->DefaultOptions->ThreadModel where DefaultOptions is TargetOptions only if TargetLowering is available in LICM.
-> In LICM, we have TargetTransformInfo and TargetLibraryInfo but again it does not fetch TargetLowering/TargetOptions in any way.
Comment Actions @efriedma - WDYT? If we can get TargetLowering/TargetMachine in LICM then we do not need to pass thread model from clang to LICM. But I do not see any optimization making use of TargetLowering/TargetMachine as it is unavailable. Do you have any suggestions on it? Comment Actions Fix review comments -
Comment Actions I think that instead of "ThreadModel::Model getThreadModel()" as the TTI API, we should just make the API something like "bool isSingleThreaded()". Which is basically, can any other thread run at the "same" time. I don't really want to include TargetOptions.h everywhere, and it makes the reason we're exposing it on TargetTransformInfo a bit more clear. I don't think you ever addressed my comment about constants. The case where that would come up as a practical issue is if you have an argument marked "deferenceable"; we know it's legal to load from an arbitrary dereferenceable pointer, but it's not legal to store to one. We can only store to locations we know are modifiable (allocas, non-constant global variables, etc.) Comment Actions
Comment Actions Understood. Updated as above.
I have added a check to avoid this transformation if store points to constant memory. Let me know WDYT on it. Have I addressed all your comments now? @eli.friedman Comment Actions
You can't use pointsToConstantMemory to prove the property you need. The problem is that it doesn't do what you want if it can't prove anything. You need a function that says the memory isn't writable unless it can prove the memory is writable (maybe call it something like "pointsToWritableMemory"). Getting the proof requirements wrong is a common trap for newcomers to writing compiler optimizations; please watch out for it in the future. Comment Actions In practice, would that be alloca and non-constant globals, or can we handle any other cases as well? I don't think doing this for noalias calls would be legal, because the noalias semantics don't say anything about the memory being writable, right?
Comment Actions Hm ... we already assume that this is fine for noalias calls (in the non-capturing case), so I guess that's a pre-existing issue that is orthogonal to this patch. I believe the right treatment here would be to only omit the isNotCapturedBeforeOrInLoop() check in the single-thread case, but leave everything else alone. Comment Actions Understood. Should I add isNotCapturedBeforeOrInLoop() as an extra check as mentioned by @nikic ?
Noted. Thanks!
Comment Actions First off, please rebase your patch on main; @nikic made changes in rG315aef66. Anyway, with that patch, the relevant bit of code currently looks like this: Value *Object = getUnderlyingObject(SomePtr); SafeToInsertStore = (isNoAliasCall(Object) || isa<AllocaInst>(Object) || (isa<Argument>(Object) && cast<Argument>(Object)->hasByValAttr())) && isNotCapturedBeforeOrInLoop(Object, CurLoop, DT); There are basically two checks here: one, that the location is readable and writable (it's a malloc/alloca/byval argument/etc,), and two, that the location isn't accessed by some other thread (isNotCapturedBeforeOrInLoop). If you know there aren't any other threads, you can just skip the isNotCapturedBeforeOrInLoop() check, and you're basically done. On top of that, you probably also want to expand the list of locations that are known to be readable and writable to include global variables. (The current code doesn't check for them because the isNotCapturedBeforeOrInLoop() check can't succeed for global variables.) Comment Actions I pushed another change that should make this code clearer: https://github.com/llvm/llvm-project/commit/388b684354cc71bd4043ddccbcfd91fb338d8b1e In single-thread mode, the isThreadLocalObject() check can be skipped. isWritableObject() can be extended to handle constant globals. Comment Actions Address review comments. Run -instnamer on lit tests. TODO - Add a negative test for the constant memory case. I am thinking on what changes needs to be done on below testcase to become it as negative test - int u, v; void f(int a[restrict], int b[restrict], int n) { for (int i = 0; i < n; ++i) { if (a[i]) { ++u; break; } ++u; if (b[i]) ++v; } } Comment Actions I had rebased with master while pushing changes. But while pushing, some other patch got merged which causes conflict. Let me rebase it again and push it again. Comment Actions Fix build bot failure caused by change in clang-format version. Combine thread model single check with global variable check.
Comment Actions Move checks (TTI->isSingleThreaded() || SingleThread) inside bool isThreadLocalObject()
Comment Actions I believe the logic is now correct, but testing could be improved. Some suggestions:
Comment Actions
promote-sink-store-arg.ll promote-sink-store-capture.ll promote-sink-store-constant-global.ll promote-sink-store-global.ll
Comment Actions Address all the review comments.
-> promote-sink-store-global.ll
-> promote-sink-store-constant-global.ll
-> promote-sink-store-arg.ll
-> promote-sink-store-capture.ll
Comment Actions Added 2 RUN lines in all the LIT tests.
I have used a simple loop to achieve above.
Done.
Done.
Done. Comment Actions Review request - @momchil.velikov @dmgreen @eli.friedman @nikic @fhahn Can someone give final review? If this patch is fine, is it ready to merge? All review comments have been addressed with no regression. Comment Actions Use below testcase for capture case - test/Transforms/LICM/promote-sink-store-capture.ll void f(int n, int u) { for (int i = 0, x = u; i < n; ++i) { x = i; } } instead of - void f(int a[restrict], int n, int u) { for (int i = 0; i < n; ++i) { int x = u; if (a[i]) { ++x; break; } ++x; } }
Comment Actions Can you please run the test cases through -sroa (as in, commit the test case with SROA applied, not add it to the RUN line)? The current tests contain unnecessary alloca+load/store+lifetime.start/end.
Comment Actions Address review comments -
I am able to run -sroa successfully on promote-sink-store-global.ll I could not run sroa on below LIT tests as everything is getting optimized and there will be no load/store left to showcase this optmization effect - (Please check below godbolt links) a. promote-sink-store-constant-global.ll - Constant global case - https://clang.godbolt.org/z/n5E69ebbz IR obtained from - const int u = 7; int f(int n) { int x, i; for (i = 0; i < n; ++i) { x = u; } return x + u; } b. promote-sink-store-capture.ll - Capture case - https://clang.godbolt.org/z/fjxn7xc46 IR obtained from - void f(int n, int u) { for (int i = 0, x = u; i < n; ++i) { x = u; } } c. promote-sink-store-arg.ll - Function argument case - https://clang.godbolt.org/z/d85qhxbsr IR obtained from - void f(int n, int u) { for (int i = 0; i < n; ++i) { u = i; } }
Comment Actions @nikic - Does the patch look fine now? Is it ready to merge? Let me know suggestions if any. Thanks for your earlier -sroa comment. Comment Actions Should I keep both prefix checks with and without flag? That is increasing number of lines of code in the testcase. After reducing the testcase with bare minumum load/store licm opt - Below are number of lines promote-sink-atomic-store-arg.ll - 30 lines promote-sink-store-arg.ll - 30 lines promote-sink-store-capture.ll - 40 lines promote-sink-store-constant-global.ll - 20 lines promote-sink-store-global.ll - 20 lines promote-sink-store.ll - 35 lines + checks generated by llvm/utils Comment Actions Add simple LIT tests -
Note - There is alloca present in capture case as requested from above comment by @nikic We should have a variant of the test with an alloca that is captured (store promotion legal with the flag, because capture does not impact thread safety). Behaviour of test cases - constant-global-sroa-instnamer.ll capture-instnamer.ll arg-sroa-instnamer.ll promote-sink-store-global.ll
Comment Actions I'm still not really happy with the minimality of these tests. They still contain many unnecessary parts. Generally, unless you are writing debuginfo tests, it is better to write test IR by hand instead of trying to generate it using clang. I've pushed a set of minimal tests at https://github.com/llvm/llvm-project/commit/b6676f3c12588cd1333de9bb3cee3a53bc71771e. Can you please rebase over those tests, and then rerun update_test_checks.py with these adjusted RUN lines? ; RUN: opt -S -licm < %s | FileCheck %s --check-prefixes=CHECK,MT ; RUN: opt -S -licm -licm-force-thread-model-single < %s | FileCheck %s --check-prefixes=CHECK,ST Comment Actions Update RUN lines with flag -licm-force-thread-model-single and rerun update_test_checks.py on promote-single-thread.ll Remove previously submitted LIT tests. Comment Actions Understood!
@nikic - Thanks a lot for above tests. I have rebased and updated it accordingly. Now, store in promote_global() and promote_captured_alloca() gets promoted as per your comments in the testcase. Is this patch good to merge now? Comment Actions Thanks. Can you please commit it for me with below details? I do not have commit access. Name - Shubham Narlawar "Shubham Narlawar <shubham.narlawar@rrlogic.co.in>" Comment Actions They don't have any relation, syncscope is a property of atomic operations. To remove all the LICM-specific aspects, the question here is basically whether it is safe to introduce a spurious (non-atomic) store of the form store (load p), p without introducing a data race. Comment Actions We see that for a single-threaded system or with option '-mllvm -licm-force-thread-model-single', the generated code quality worsens for a very simple example: https://godbolt.org/z/5xhY1EcGb. Would it be possible to fix this? Comment Actions This is not really a regression at the IR level. Ideally the loop would be removed entirely after indvars loop exit replacement, but it currently doesn't support this pattern where the value from the next-to-last iteration is taken. There's a couple open issues for that, but somehow I can't find a single one of them... Comment Actions Ah, I found one of the issues I remembered: https://github.com/llvm/llvm-project/issues/51396 Comment Actions Yes, it looks like the similar conditions are triggered by the code shown in that issue, but perhaps the reason isn't that indvars simplification pass couldn't really deduce the the value from last but one iteration (as seen in the example I shared: https://godbolt.org/z/5xhY1EcGb), but maybe the code structure introduced by this change? The indvarsimplify did work fine in the absence of the option '-licm-force-thread-model-single'. With this option, the store is sunk into exit block in the first pass of LICM itself (just before loop-rotation). This then presents a different loop body to loop-rotation, which doesn't result in a guard condition. In the end, indvarsimplification see the exit block depending upon not just the induction variable value, but also on the original value of 'theGlobalVar', which perhaps make it difficult to reduce this to just an exit val computation. As opposed to this, the case where it works (without the LICM option), the indvar simplification sees the loop as invariant i.e. the exit values just depends upon only the values from within loop, which then perhaps helps in reducing the loop to just an exit value computation. In other words, with this LICM change, the store, which otherwise was getting skipped is now always done, making the code in the exit block to choose between the two values to store, the original value of global or the one less than exit value from the loop. The original value of the global location is propagated through loop making it more complicated than required? So, perhaps this is not required, but instead the code can be transformed such that the choice of values can be directly done in the exit block (e.g. a select between original value or computed exit val based on loop condition, similar to loop-rotation's guard condition)? This will leave the loop only with induction variable computations and thus can still be just reduced to calculation of exit values. |
What should be the return value here? Should it be ThreadModel::INVALID or should I keep it same?
Should I add a new value to existing enum like below?
Above enum is defined in llvm/Target/TargetOptions.h