This patch adds the patterns to select the dot product instructions.
Tested on aarch64-linux with make check-all.
Differential D67645
[aarch64] add def-pats for dot product sebpop on Sep 16 2019, 7:24 PM. Authored by
Details This patch adds the patterns to select the dot product instructions. Tested on aarch64-linux with make check-all.
Diff Detail Event Timeline
Comment Actions I've got a cheeky request, and I appreciate that should go in a separate patch, but while you're at at it would you mind repeating this exercise for the ARM backend and AArch32? Comment Actions The new patch does not use the first argument of the dot product instruction: we now set it to zero. Comment Actions Many thanks for addressing this!
Comment Actions I chose 30 as it seemed to work: the rest of the file had complexities <= 20.
I'll let you decide if clang-format is the best at tablegen ;-) // dot_v4i8 class mul_v4i8<SDPatternOperator ldop> : PatFrag<(ops node : $Rn, node : $Rm, node : $offset), (mul(ldop(add node : $Rn, node : $offset)), (ldop(add node : $Rm, node : $offset)))>; class mulz_v4i8<SDPatternOperator ldop> : PatFrag<(ops node : $Rn, node : $Rm), (mul(ldop node : $Rn), (ldop node : $Rm))>;
Sure, there may be some patterns that are not yet covered by the def-pats.
Please suggest better names, I'm fine changing these names. Comment Actions To catch more dot product cases, we need to fix the passes above instruction selection. I looked at the basic dot product loop: int dot_product1(char *a, char *b, int sum) { for (int i = 0; i < 16 * K; i += 1) sum += a[i] * b[i]; return sum; } for different values of K:
Looks like if we want to catch more dot product patterns, we'll need to fix the SLP and loop vectorizers. I am also looking at some code that comes from TVM that is a higher level compiler generating code to LLVM IR. Comment Actions
Yep, this actually what I was expecting. That is, I don't see a problem with this pattern matching here if it catches a few cases and helps you. But yes, to do the heavy lifting, this is probably a task for the loop vectorizer. Comment Actions There are few things missing in current work such as indexed dot product or what they call s/udot (vector, by element) in the ARM document (no need to do it now but a comment about that would help). There is also sve dot product. We need to port this code to SVE. I agree that this work will miss many opportunities and the middle end will optimize the code in such a way that pattern matching does not work. I think that dot product need to have it own pass or subpass in the middle end. I see three places where it can be done: 1) early before the vectorizer in the same way we recognize min/max/etc., or 2) we can do it within the vectorizer as dot product is mainly a vectorization problem, or 3) we can do it post the vectorizer similar to the SIMD interleaved load/store. The third option, while not the best, has more chance of being accepted given that it less disruptive. Any tought on this as we may contribute to this effort in the future? Comment Actions You are right, it is not the task of instruction selection to vectorize the code: Comment Actions I added a comment in the patch.
Same here, I added a FIXME comment.
I think I like your solution 2, and I think pre and post vectorizer passes would work as well. In the SnapDragon compiler we used to generate ARM builtins/intrinsics directly from the vectorizer: Similarly, for the dot product we may want to generate a target independent builtin, Comment Actions yep, I think we need to generate those reduction intrinsics, which we can then lower/instruction select. I don't think there's anything controversial about that, intrinsics gets generated in a lot of different cases. Are you planning on doing that now and turn your attention the vectorizer? That would make this work obsolete when it is ready. Comment Actions I looked at both the SLP and loop vectorizer and I think this is more work than I can do right now.
not a this time.
Instruction selection and vectorizers are orthogonal. Comment Actions
Well, okay, sure..... but depending on what tricks the vectoriser does, its output can be different, and the input to instruction selection be different, triggering different instruction selection patterns.
So, if the vectoriser for example emits dot product intrinsics, these patterns won't trigger and then it's say dead code if you see what I mean, but please correct me if I am wrong. But at the same time, as I also said before, if this helps a few cases now, I don't see what's wrong with a nice little bit of pattern matching. Comment Actions What if the code is written with intrinsic but using mul, reduce (say similar to last test in this patch), then this patch will optimize that into dot product instruction. So, for legacy code that was written with old intrinsic, then this patch will remain useful even after dot product is implemented in the vectorizer. Note that if somebody will be writing new code with intrinsic using mul, reduce instead of dot product, he is probably doing that for a reason and will want it to stay that way (he can then use the right flags to disable dot product). Comment Actions Yes, looks reasonable to me. |
I think there is an error in this case: as we duplicate the original value $Vo across the 4 lanes of the dot product, and then in the end we do the ADDV reduction across lanes, we end up with 4 times the original value.
I will prepare an updated patch to fix this.