- User Since
- Dec 5 2014, 3:20 PM (167 w, 3 d)
Tue, Feb 6
Fri, Jan 26
The change looks reasonable to me. Though I wonder if it's possible to trigger a similar exponential behavior with GEP instructions - if that's possible, we should put them to VisitedUsers too.
Tue, Jan 23
No concerns from my side, thanks for making the change!
Jan 15 2018
As you said, the pass spend very little time, cannot we decide on moving it to -O3 in the future when/if other heavy optimizations is added to this pass?
And even then, we can decide to run part of them on -O2 and the rest on -O3.
My main concern with that is that it's actually really hard to demote something to lower optlevels retroactively.
Jan 9 2018
Hi and sorry for the late reply, I've just returned from the holidays break.
The numbers posted before look good. I wonder though if it would make sense to only run this pass on -O3. I assume that even if now the pass spends very little time, it will grow in future and the compile-time costs might become noticeable.
Dec 20 2017
Thanks for review!
Phabricator didn't catch the review thread. The patch was committed in rL321236.
Yes, I'm not really worried about that aspect; I'm more concerned that the transform somehow isn't triggering in cases where it should.
I just tried digging a bit; it looks like the generated code is worse, particularly for avoid-cpsr-rmw.ll. I'm not sure that's really the fault of this patch, exactly... it looks like the sinking does in fact happen, but the end result is different because of related transforms. Looks like it's exposing a bad heuristic elsewhere?
Yes, it does behave differently after the patch. However, from code sinking point of view the new behavior seems better to me.
Dec 19 2017
I don't think SimplifyCFG makes any other changes, but the test isn't for SimplifyCFG - as far as I understand it, it's for instructions we generate in the backend. The test was just written to trigger some specific situation, and with this change it stopped triggering it.
Dec 18 2017
This also fixes a big compile time regression we found that was caused by rL279460 (and consequent follow-ups). In the problematic test-case we have a huge function, where one of the blocks have more than 1000 predecessors.
Dec 13 2017
Oct 17 2017
Is the patch ok to commit?
Oct 13 2017
Generally switching to generic containers to LLVM ones lead to improvements, but 17x seems quite large to me (not complaining).
17x probably comes from the test being huge: when we're building with LTO, we're getting a ~50Mb bitcode file.
Oct 4 2017
How do you get the diff out of curiosity?
It's produced by the compare.py script:
python test-suite/utils/compare.py -m compile_time unpatched.json patched.json python test-suite/utils/compare.py -m size unpatched.json patched.json
But to run it, you first will need pip install pandas, which can take up to 10 minutes :)
Hm, from the files you uploaded I don't see a slowdown on 7zip, but see some others. Are we talking about the same numbers?:)
Oct 3 2017
Which part is cumbersome?
I was talking about compare.py specifically. Namely, it uses pandas package, which is not as commonly installed as others, and installing it wasn't very easy/fast if I remember it correctly. All I wanted to say is that these tools are required some getting used to and might not work from the first try, and in case of compare.py you might want to not use if you only need to do it once.
Oct 2 2017
I'd recommend using LNT, here is a script with full ground-up setup for your convenience:
Aug 28 2017
I added a cl::opt...
You seemed to forget to use it in the test though :)
FWIW, this only triggers with SystemZ as triple (probably because it sets peeling parameters differently?).
Yeah, that's probably the reason. I suggest to (find and) pass the necessary parameters explicitly so we don't depend on target preferences at all.
The change looks good to me, but I'd like to see a more simplified test. We probably don't need 90% of its instructions, and also it doesn't need to be SystemZ-specific.
Aug 22 2017
Aug 10 2017
FWIW, I strongly support the idea of adding more detailed timers into the frontend. Thanks for working on it!
I probably won't be very helpful in reviewing this code, but I'd appreciate if you CC me in the future patches.
Jul 25 2017
Looks good to me. Out of curiosity, does the patch improve compile time on usual tests (rather than on corner cases)?
Jul 12 2017
Does this answer your question?
Yes, thank you for the detailed answer. The patch looks good to me then (please also find some minor comments/suggestions inline).
Don't we break critical edges before unrolling? If so, then I think this situation should never happen (and if it does, we need to understand why).
Jul 6 2017
Do you have any hints for how I might be able to tickle the errors you found?
I found this issue in our regular 2 stage PGO build (generate-profraw target from tools/clang/utils/perf-training/CMakeLists.txt invokes LIT tests). I think we should just replace all appearances of lit.util.capture with subprocess.check_output - at least, it shouldn't make things worse.
Jul 5 2017
There are still more uses of lit.util.capture. I fixed one of them in r307201 (it broke PGO builds), but there are still some remaining:
./test/lit.cfg: llvm_src_root = lit.util.capture(['llvm-config', '--src-root']).strip() ./test/lit.cfg: llvm_obj_root = lit.util.capture(['llvm-config', '--obj-root']).strip() ./test/Unit/lit.cfg: llvm_src_root = lit.util.capture(['llvm-config', '--src-root']).strip() ./test/Unit/lit.cfg: llvm_obj_root = lit.util.capture(['llvm-config', '--obj-root']).strip() ./tools/clang/test/lit.cfg: llvm_src_root = lit.util.capture(['llvm-config', '--src-root']).strip() ./tools/clang/test/lit.cfg: llvm_obj_root = lit.util.capture(['llvm-config', '--obj-root']).strip() ./tools/clang/test/Unit/lit.cfg: llvm_src_root = lit.util.capture(['llvm-config', '--src-root']).strip() ./tools/clang/test/Unit/lit.cfg: llvm_obj_root = lit.util.capture(['llvm-config', '--obj-root']).strip()
I wonder why they didn't cause any problems (or why we didn't find the problems). Do we need to change them the same way?
Jun 29 2017
Looks good to me (see one comment below)!
We're seeing more issues (I think they are unrelated to the FileBasedTest issue mentioned before).
Jun 28 2017
Jun 12 2017
I ran SPEC2006 and CTMark with and without the patches and didn't see any performance difference.
Sorry for the delay, I missed this thread somehow.
I'm running tests now, and publish the results as soon as I get them.
Jun 1 2017
I looked at your benchmarks numbers, with particular attention to the worst regressions, and I'm not convinced these benchmarks are representative.
What about tramp3d-v4 and SPECs? As far as I remember, tramp3d-v4 compiles pretty long and is pretty stable (and it is a part of CTMark).
May 23 2017
May 4 2017
May 3 2017
This change is NFC right?
I wanted it to be so, but apparently it's not. See more details in my previous reply.
- Fix indentation.
Thanks for taking a look, I've updated the patch.
- Address Sanjoy's remarks.
Apr 28 2017
Apr 20 2017
Thanks for addressing my previous remarks! The change looks good to me (modulo Eli's remark regarding the check).
Apr 19 2017
Apr 12 2017
The new patch is here and it's 25% faster than the unpatched version on the testcase included in the PR
Apr 10 2017
This is precisely equivalent to the stack algorithm i outlined.
I didn't refresh the page before posting, sorry :)
If the slow part is checking if the block dominates any exit, how about we precompute set of blocks meeting this requirement in advance?
Apr 5 2017
Can you verify that by inserting an assertion there?
Mar 22 2017
Thanks for posting this, Hal!
Mar 19 2017
Thanks for the review, Sanjoy!
- Address review remarks.
Mar 17 2017
Mar 16 2017
This version looks good to me too.
Mar 15 2017
Do you know if we already have a utility somewhere that detects this?
Can we require the loop to be in a simplified form and its latch to be exiting? Will it suffice?
Looks good to me, but, as you noticed in the PR - do we want to work with such loops at all? Did you check what happens if we just bail out on such loops?
Mar 14 2017
Okay, so is the general plan here to?
- Generally allow unrolling of uncountable loops
- Adjust the profitability heuristic to account for the cost of the branches for the uncountable case
- Include, as necessary, in our general profitability heuristic (if we don't get this naturally), savings from phi-derived values which are cyclic with a small cycle length (like the s = -s case).
I agree that we should start from (1). But I don't understand, why (2) (and (3) ) is considered special for the uncountable case. AFAIU, the current heuristic aims at simulating these savings (removed branches) already - even though we do not try to accurately predict cost of rolled and unrolled version of the loop. Am I missing something?
Mar 11 2017
Requesting another iteration mostly because I too want to look at it again with fresh eyes to see if I missed anything.
- Add a comment about INT_SMIN.
- Address review remarks.
Mar 10 2017
Uncountable loops where previous value is reused (save one+ instruction for each value)
What do you mean by uncountable? How is it different from a loop with unknown trip count?
What is a previous value? Is it a value from a previous iteration? If that's the case, any induction variable satisfies that condition.