- User Since
- Jan 26 2016, 2:17 AM (194 w, 6 d)
Thanks, will take that on board, and I will pick this up after the llvm dev conference.
Fri, Oct 18
Thu, Oct 17
Thinking about this some more, I think it would be best to at least check some features of the loop for legality:
- no vector widths greater than 128 bits.
- all vector operations should have the same number of lanes.
Wed, Oct 16
Yep, thanks again!
Thanks for reviewing!
Tue, Oct 15
Thanks for clarifying.
Mon, Oct 14
Just wanted to check if we are happy with this as an initial commit?
Fri, Oct 11
Bit of a drive-by comment, but I can't say I am big fan of all the string matching on the register names. Not sure if this is a fair comment, because I haven't looked closely at it yet, but could we use more the ARM::R[0-9] values more? Perhaps that's difficult from the Clang parts?
Thu, Oct 10
Sure, will do, thanks again for taking a look.
Thanks! Typo fixed.
Yep, looks good to me.
I have commit all my pragma patches, so now back to the last bit, this doc update.
This doc change should now reflect our implementation. Are we happy for this to go in?
Thanks again, looks good.
Wed, Oct 9
Many thanks again for reviewing. I will add the check-not before committing, and as I said, I will follow up soon to address the other improvements opportunities.
Cheers, moved the test back to directory LoopVectorize where it once was instead of LoopVectorize\X86 (which was indeed a mistake, and then added the extra runline out of frustration as the skx core was unknown to me).
Nice one, thanks for fixing! I didn't have the bandwidth to look into this.
Tue, Oct 8
Added two more remarks
Thanks for taking a look!
Mon, Oct 7
Looks like a nice bit of isel to me.
- cleaned up the test case a bit.
- couldn't reuse an existing run-line, I guess because of -mcpu=skx, but a separate run line seems fine to me.
Sat, Oct 5
Thanks for your comments and thoughts.
Fri, Oct 4
The earlier "fix at the core" thought was for LV to do something along these lines:
Apologies for the early ping and for the impatience! But I am just really keen on getting this fixed. :-)
Tue, Oct 1
Hi Ben, this is not really my area of expertise, but it all starts to make some sense to me. I was expecting this transformation to happen earlier, but this is where the magic happens, and this probably where it belongs.
Just to make sure I haven't missed anything, I would like to take this patch and run some numbers, for which I need a little bit of time. If in the mean time someone with some more experience in this area here has a look too, that would be great....
I am getting up to speed with this... and created this reproducer:
Mon, Sep 30
- modified SCEV so that it doesn't add Predicates when OptForSize is enabled. This has indeed the consequence that vectorisation will happen, it will scalarise the stores.
- moved the test case to an exisiting optsize file.
Fri, Sep 27
Just checking because you mentioned:
Thanks for all the pointers and info, that helped me getting back on track.
Hopefully things are actually as simple as the change in this patch.
Hello James! Thanks for informing us. If this does what you say, it essentially disables global merging with -fdata-sections, then indeed we would get a little bit unhappy about that.
I could run some numbers with this patch, but I can guess what the outcome will be..... I haven't looked at this patch or the PR, but does this need to be default behaviour, or can this e.g. be done under an option?
looks good to me
Thu, Sep 26
The added tests check exactly what's being changed/added here, so that's excellent.
I was just wondering if it would be good if we also add a more end-to-end/unit llc test, that shows actual loloop creation for these cases? That is, if these tests are not already there. I had only a very quick look, and am not sure, but you'll probably know.
Wed, Sep 25
Cheers, nice one, LGTM
Looks all like very mechanical changes to me. The only thing I am not a big fan of is this UnsupportedFeatures, but I understood that there is not really a much better alternative.
Tue, Sep 24
The current implementation of comparing loads is quadratic, yes, but you could use a different algorithm, like splitting loads into a base pointer plus an offset, and constructing a map from base pointers to load offsets.
I also hadn't though much about complexity, but indeed, function RecordMemoryOps, for example, is a bit of an expensive hobby.
Looking at it again, the bookkeeping looks essential, I don't see an easy way to reduce complexity. Delaying it may help a bit, but fundamentally that won't change much I think.
The usual way to deal with expensive hobbies is to introduce a threshold, and bail if it exceeds that.
Many thanks for reviewing!
Will now look at PR43371
Mon, Sep 23
Added minsize func attr, and also added the check for the new vectorization remark, which I had also missed porting to the new test file... :-( Thanks for spotting. The fix for the other assert will be better than this...
sorry, forgot to clean that up, keep the original order
Another assert which presumably requires similar update is the "Cannot SCEV check stride or overflow when optimizing for size" one.
All comments addressed.
Thanks, looks good to me.
Sun, Sep 22
Restored the assert and expanded it with the !forcedToVectorize check
Sep 21 2019
Ah, I had completely missed that this is about the interaction of OptForSize and the loop.vectorize loop hint!
Vectorization is forced, which means that we vectorize even though OptForSize is set and we need to emit runtime checks.
Which commit does this trace back to, D65197?
Thanks. The runtime check is now done earlier.
Sep 20 2019
So the pass shouldn't attempt to convert a loop that contains any vector intrinsics, other than masked.load and masked.store. So, we will actually need to do extra work for this to operate with MVE intrinsics. Even once we've done that, the pass only tries to remove the old icmp predicates, which we pattern match. If the user defined predicates match those that the vectorizer outputs, then there's no reason why we won't perform this transform. I would say that I'd add a test, but we don't have intrinsic support yet...
Just curious, if the input is already a tail-predicated loop (e.g. coming from intrinsics), would it be possible we incorrectly start removing/inserting things?
Sep 19 2019
Yes, looks reasonable to me.
Perhaps wait a day with committing just in case someone has concerns.
Instruction selection and vectorizers are orthogonal.
Sep 18 2019
yep, I think we need to generate those reduction intrinsics, which we can then lower/instruction select. I don't think there's anything controversial about that, intrinsics gets generated in a lot of different cases.
To catch more dot product cases, we need to fix the passes above instruction selection.
ok, sounds all good.
Many thanks! I've passed on the message, hopefully we can do something about this.
Sep 17 2019
Many thanks for addressing this!
This pattern matching looks very reasonable to me, just a round of nits:
- was just curious about the AddedComplexity = 30
- perhaps I am wrong, but was wondering if clang-format agrees with some of the indentation,
- would some more test bring any benefits/additonal value? Like test with a loops, or tests with no aliasing attributes?
- please ignore if you disagree, but thought that some names like ldop, K1, K2 could be slightly more informative/consistent.
I've got a cheeky request, and I appreciate that should go in a separate patch, but while you're at at it would you mind repeating this exercise for the ARM backend and AArch32?
Just uploading new diff for completeness; I only had to change a test-case, and thus thought that committing this is okay.
Sep 16 2019
We're not generating any TP loops yet, so we're okay!
They are different things. I actually don't think that the specification prevents any instructions from being in a TP loop, we just need to be careful of these. We should be able to allow most (all?) of these once we implement some better checks in the backend to look at the operands.