Fri, Apr 19
LGTM with a minor fix needed below.
Darwin still uses the old LTO API, which is why the lto Config based handling in the new LTO API (LTO.h/LTO.cpp) are not being used on this path.
Wed, Apr 17
Right, I should have said something here. Without D60266, there was no change in code size on the set of benchmarks, most likely because loop-rotate is not run and they are not in the required form for unrolling to happen.
I guess it would be sufficient to drop minsize from the attribute list, assuming minsize always come with optsize?
minsize only comes with optsize when your function is effectively -Oz. If your function is -Os, it will only have optsize.
So, I would keep them both since -Oz behaviour is more important here (IMO). If you want to test both I guess you'll have to duplicate the test, sadly.
Tue, Apr 16
Mon, Apr 15
I haven't been through the details yet, but I have a few high-level thoughts.
Thu, Apr 11
For now I would appreciate any high-level input, like is this the right place to do this or should we have a separate range propagation pass? I think we might be able to use this as a replacement for (parts) of CorrelatedValuePropagation, to avoid a few instances of recursively walking the IR.
I'm not necessarily sure I like this. It's simply spreading in a general framework a concept that shouldn't exist to begin with. I'd rather remove forced constant from SCCP, even though it might result in us losing some optimization power.
Also how does this compare with the existing InductionDescriptor (http://llvm.org/doxygen/classllvm_1_1InductionDescriptor.html)?
I've left a first pass of comments. I would recommend splitting off the parts that are not directly required for getBounds, like getGuard, as those seem orthogonal. It seems there are a lot of checks that ScalarEvolution can already take care of and IMO it would be better to require ScalarEvolution for those things.
Wed, Apr 10
The main problem this patch addresses is the RHS == full-set case, where we exit early.
Thanks for your comments Ayal! I'll update the patch in a bit.
Looks like a straight-forward fix, LGTM
Looks like a straight forward fix, LGTM as there is no CMAKE_C_CFLAGS variable.
Tue, Apr 9
Thanks Francesco! I think the test looks good now. If you need someone to commit it, just let me know.
Thu, Apr 4
Ah thanks, I was missing the global nature of physical pointers. I couldn't find this described anywhere (besides some of those things mentioned at a tutorial at EuroLLVM). If this is not described anywhere, do you think it would make sense to add it to the AliasAnalysis documentation page, for example?
Yes, I think we should add this to the AA docs. I think the best reference for a consistent LLVM memory model is https://sf.snu.ac.kr/publications/llvmtwin.pdf .
Here are the times I got for compiling the aggregated bitcode files for clang and a few others. Before is ToT, After is with -forget-scev-loop-unroll=true
File Before (s) After (s)
clang-9.bc 7267.91 6639.14
llvm-as.bc 194.12 194.12
llvm-dis.bc 62.50 62.50
opt.bc 1855.85 1857.53
The current test I have is not publishable, I'm still trying to get one.
In the mean time, the above results might motivate having this enabled by default.
I'm currently prototyping a version of DenseMap, that uses bit vectors to mark empty/tombstone slots to circumvent the issue. There are some scenarios where this could potentially also improve runtime performance, at the cost of 2 bits per slot. What do you think?
Why do we need to rotate the loop before unrolling? llvm::UnrollLoop currently refuses to unroll loops where the latch is an unconditional branch, but that isn't a fundamental limitation, as far as I can tell. We already support unrolling loops where the latch is not the exit branch; allowing loops where the latch doesn't exit at all is a minor extension. Granted, it might be more efficient to explicitly rotate the loop before unrolling, so we don't clone quite so much code.
Fix order whenever it is queried.
LGTM. Just a couple little comments, but none of them are critical:
- In D60266 you have some code size numbers; it would be nice to have something similar here.
Wed, Apr 3
It's not sufficient to check if you can merge two stores into a valid node; there are backends where you need 4 or more to get a legal merged store.
If you look at target-specific implementations of CanMergeStoresTo it essentially serves as a context-specific find maximum store which is what we need here. If you massage that interface a bit you can fold most of this check in there.
Ah thanks, together with @aqjune 's response, I think I now know what I was missing. If we have something likeint8_t* obj1 = malloc(4); int8_t* obj2 = malloc(4); int p = (intptr_t)(obj1 + 4); if (p != (intptr_t) obj2) return; *(int8_t*)(intptr_t)(obj1 + 4) = 0; // <- here we alias ob1 and obj2?
I thought the information obtained via the control flow, p aliases both obj1 and obj2, is limited to the uses of p, but do I understand correctly that this is not the case and the information leaks to all equivalent expressions (that is for the snippet above, without GVN or any common code elimination)?
Yes. In the abstract LLVM machine pointers have provenance and integers don't. All integers with the same bitwise value are equivalent (can be replaced one for another), but bitwise identical pointers are not necessarily equivalent. This lets us do aggressive optimization on integers while still keeping a strong (ish) memory model.
A consequence of this is that when you convert (intptr_t)(obj1 + 4) back to a pointer, the new pointer's provenance includes all pointers whose bitwise value could have been obj1 + 4.
Add comment and make even lazier.
Tue, Apr 2
Out of curiosity, would it be possible to provide a .bc file of such a program? Maybe there is some common structure where forgetting everything is faster or some heuristic we can use.
After a bit more benchmarking, I think this patch makes things slightly worse in the general case. I've put up a patch that updates ScheduleDAGRRList to update the topological order on demand, D60125, which gives small, but stable improvements on CTMark. I have to revisit this patch and see how we can deal with extreme cases, without making things worse in the general case.
Mon, Apr 1
For instance:// Let's say we know malloc(64) will always return a pointer that is 8 byte // aligned. int8* ptr0 = malloc(64); int8* ptr1 = malloc(64); int8* ptr0_end = ptr0 + 64; // I'm not sure if this comparison is well defined in C++, but it is well // defined in LLVM IR: if (ptr0_end != ptr1) return; intptr ptr0_end_i = (intptr)ptr0_end; intptr ptr0_end_masked = ptr0_end_i & -8; // I think the transform being added in this comment will fire below since it is // doing inttoptr(and(ptrtoint(ptr0_end), -8)). int8* aliases_ptr0_and_ptr1 = (int8*)ptr0_end_masked;
Right now aliases_ptr0_and_ptr1 aliases both ptr0 and ptr1 (we can GEP backwards from it to access ptr0 and forwards from it to access ptr1). But if we replace it with ptr0_end then it can be used to access ptr0 only.
Sun, Mar 31
Thanks for taking a look!
Fri, Mar 29
I am also not entirely sure how control dependencies could add new underlying objects with this patch
Please read this https://bugs.llvm.org/show_bug.cgi?id=34548
Thu, Mar 28
Yep, I agree on that we should keep the stress testing mechanism. It will be very useful to make sure that the construction and predication (and maybe other transformation) are robust enough since we can run it on loop nests that are not necessarily vectorizable.
Something important, though, is that we shouldn't use this mechanism to bypass legality or pragma simd requirements to vectorize a loop, i.e., we shouldn't use it to generate actual vector code.
What are you trying to achieve, Francesco?
Fix calculation: use signed right shift to divide by 2 with round to -inf and
add 1 for negative results where one value is odd and the other even to get
round to 0 behavior. Proof: https://rise4fun.com/Alive/Plqc
Thanks Francesco. I think we can only remove the arbitrary picking of VF = 4 with VPlanStressTest. I think we still need the other parts for testing, to build VPlans for loops without the pragma, which are then not vectorized.
Wed, Mar 27
@rnk is there any chance you still have the full reproducer? I did not manage to reproduce the long time spent in DSE with the snippet above and I cannot share the reproducer I have unfortunately :(
Try this bitcode I attached in the tracker:
This is the time report I see for it:$ opt -time-passes -O2 longbbinit.bc -o NUL ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 16.4531 seconds (16.4770 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 9.8281 ( 64.8%) 1.2188 ( 95.1%) 11.0469 ( 67.1%) 11.0376 ( 67.0%) Dead Store Elimination 1.7344 ( 11.4%) 0.0000 ( 0.0%) 1.7344 ( 10.5%) 1.6974 ( 10.3%) Global Value Numbering 0.6563 ( 4.3%) 0.0156 ( 1.2%) 0.6719 ( 4.1%) 0.6795 ( 4.1%) Function Integration/Inlining 0.3750 ( 2.5%) 0.0000 ( 0.0%) 0.3750 ( 2.3%) 0.3846 ( 2.3%) Value Propagation 0.3750 ( 2.5%) 0.0000 ( 0.0%) 0.3750 ( 2.3%) 0.3743 ( 2.3%) Value Propagation #2 0.2344 ( 1.5%) 0.0156 ( 1.2%) 0.2500 ( 1.5%) 0.2515 ( 1.5%) Bitcode Writer
I'm not sure it's the exact issue you're looking at here, though.