- User Since
- Feb 4 2016, 3:03 PM (67 w, 4 d)
Comparing the existing implementation and this patch, I don't observe noticeable compile time different with Video-SIMDBench. I compiled the benchmark suite 3 times each and the median was 27.12 sec vs 27.55 sec, while the average was 27.96sec vs 27.89 sec. There were some code size differences, but it is simply more vectorization results bigger code size. There's no difference between OptForSize and non-vectorized.
Fri, May 19
Addressing Ayal's comments.
@wmi Thank you for your reply. I agree on you that we should consider tied operand group, and as @qcolombet mentioned in the previous comment, if tie operand information is already available before this pass, I'd like to discuss where would be the best place to implement this.
Wed, May 17
@qcolombet Actually you're right. Having tied operands, this can be done within SSA. Do you have a suggestion for a better place to implement this? I thought what isRevCopyChain does is most close to what this patch does.
Tue, May 16
@MatzeB Thank you for your comments. I'm collection compile time numbers now with spec2006. For our internal benchmark, which takes about ~40min to compile, the compile time difference was negligible.
Addressing comments from @MatzeB.
Sat, May 13
@Ayal Thanks for the pointer.
Fri, May 12
Ping. Can someone please let me know what would be the best way to have this patch reviewed? Thanks!
Thu, May 11
I don't have a strong opinion about treating static trip count and profile-based trip count or not. I treated them separately because existing implementation sets OptForSize to true instead of returning false when it makes a profile-based decision. It would be great if someone can explain the reason behind that.
Sun, May 7
Thanks @Ayal for your comments! If the profile-based trip count checking is moved above the line
Addresing comments from Ayal.
Fri, May 5
Fri, Apr 28
Ping. If you're too busy to review this patch, could you please recommend someone else? Thanks!
Thu, Apr 27
So I evaluated loop vectorizer with https://github.com/malvanos/Video-SIMDBench, which is introduced in this paper: http://ieeexplore.ieee.org/document/7723550/. I first built it without profile data and ran linux perf with a command of
Use profile-based trip count estimation only when trip count cannot be computed.
Mon, Apr 24
@mkuper I can do that, but the issue is that the vectorized part doesn't dominated the execution time of the benchmark (profile is pretty flat across multiple functions), so it may not be suitable to tell the impact of vectorization change. Let me think about a good way to measure the it. Thanks!
@mkuper This helps with our internal benchmarks, but I don't have numbers with public benchmarks. What would be the good benchmarks to evaluate the vectorization? Thanks!
Apr 21 2017
@MatzeB I was considering other passes as well for the sake of simplicity, but I concluded two-address instruction is the right place to do. The problem doesn't appear while we're in SSA, but becomes an issue when we make an decision about which operand register should be the destination register. I think this is why isProfitableToCommute function is in two-address instruction pass.
Apr 14 2017
Friendly ping. Thanks!
Apr 12 2017
With numbers, it doesn't seem worth to turn it on by default. @dberlin, just curious, do you have any rough estimation about how much compilation time can be saved by implementing rank-order based iteration optimization?
Below are the numbers with spec2006 (The numbers in the last row are overall for SPEC score and total for compile time and binary size):
Apr 11 2017
@dberlin Thank you for your comments. I agree on you that it is not easy to invent a good heuristics around this knob, but still I find this useful for some cases that actually matters. Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?
No perf numbers with public benchmarks yet, but with our internal benchmark I confirmed that aggressive load PRE removes redundant instructions in a hot loop. I'll report spec2006 numbers later. This patch is for quick unlocking of the feature before making more fundamental improvement.
Apr 7 2017
small changes to the test.
Mar 14 2017
Mar 13 2017
@aprantl Could you please give your opinion on the inlined questions? Thanks!
Mar 7 2017
Mar 6 2017
@aprantl Thanks for the review! I added inline questions asking for your opinion about consistency vs guideline.
addressing comments from @inglorion
Mar 3 2017
@inglorion That makes a lot of sense. Maybe we don't even need -g, because -S -emit-llvm shows source_filename. I'll run some more experiments with relpath/abspath and debug locations, and submit a revised patch. Thanks for the comment!
Friendly Ping. Thanks!
Mar 2 2017
Rebase. Fix test/CodeGen/X86/and-sink.ll that added 9days ago.
Mar 1 2017
Feb 27 2017
Feb 24 2017
Yes I observed the case from refresh_potential function in spec2006 429.mcf. The original C code snippet is below:
Feb 23 2017
@aprantl I like the idea of having a convenience function. I'm afraid I still don't understand @dblaikie on getMergedLocation is better than the proposed convenience function in preventing buggy profile.
Feb 22 2017
Friendly Ping. Thanks!
Feb 21 2017
Addressing @aprantl's comments. Thanks!
@eric_niebler Do you want any more experiments with this patch? I think Windows machines not printing warnings/fixits for absolute path is a separate issue with this.
Personally I prefer the existing API, because it doesn't allow the transient state before merging. For example, in lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp in this patch, DebugLoc of NewSI is not actually valid after line 1430. Existing API doesn't have this problem.
Feb 20 2017
Now avx512-fsel.ll is auto-generated from the script.
Some tests need to be updated as new DebugLoc changes instruction ordering.
Feb 16 2017
Feb 15 2017
@eric_niebler I just tried it on Windows machine, and it just succeeded with no warnings/fix-it. Is that expected?
Use MachineBasicBlock::findBranchDebugLoc instead of getBranchDebugLoc
Make it explicit that the test doesn't support windows. @eric_niebler, my original intention was avoiding use of platform-dependent path separator, but now made it explicit that the test is not for windows, it should be okay to use '/'. Thanks for the comments!
Feb 14 2017
Thanks for the review!
Oops, I missed it. Thanks!
nit: remove unnecessary #2 at the end of llvm.dbg.value declaration
Addressing @eugenis's comments. Thank you for the suggestion!
Feb 13 2017
@dblaikie Make sense. I addressed it. Thanks!
Friendly ping. Thanks!
Feb 11 2017
Feb 9 2017
Feb 7 2017
Implemented a separate 'findBranchDebugLoc'. This functions does essentially the same job as 'getBranchDebugLoc' function in BranchFolding.cpp. Unlike getBranchDebugLoc, the new findBranchDebugLoc finds all branch instructions inside the function and returns the merged DebugLoc of them. I think it is more conservative, Though I can't imagine the case of two branch instructions in the same machine basic block have different DebugLoc.
- Merge branch 'master' into mbb_metadata
Feb 6 2017
@sebpop Thank you for your review!