When using Clang's __builtin_shufflevector with a 16xi8 or 8xi8 source and runtime mask on an AArch64 target, LLVM currently generates 16 or 8 extract+and+insert operations. This patch replaces these inserts with (a vector AND +) NEON's tbl1 intruction.
Good point. I've added v8i8 too. I only had the v16i8 use case in mind when writing this. This also works for other vector sizes, but that requires some additional logic to multiply and and byte offsets, as TBL only supports v16i8 and v8i8 directly. I've seen some of that around the code and I think x86 instruction selection also does this. If you think this is valuable and the right place to do it, I'd be happy to implement that in a follow-up patch.
I've added a few tests to cover the exits. Please let me know if you think something is missing. I also noticed that some of my assumptions were too pessimistic in the process, so I removed them.
Hi - I like the idea of generating tbl instructions from non-constant masks. Does the patch need an update to be based against main?
This can test against the EVT directly: VT != MVT::v16i8 && .. .
This doesn't need the cast any more.
Should something check that the inner extract element (MaskSource) has an index equal to i?
You can concat with undef if all the needed elements are in the bottom half.
It would be good to have a test case without this And mask, just using %mask directly.
Add a path without AND mask.
We now have a path that does not need an AND on the mask source. We now either expect all elements to have an AND or none.
Also addresses some review comments.
@dmgreen Thanks for the feedback. I've addressed the comments you made. I've renamed the method, as it does not require an AND anymore.
re: update to base against main. I tried to merge the current main branch into another patch (using git) and completely broke my setup, i.e., arc is doing really weird things, trying to lint hundreds of files, etc. So I'm not sure what the correct way is to do this. Do you maybe have a pointer for me? This is my first patch, so I'm new to this setup.
I also had this in mind but was thinking about putting it in another patch. v16i8 and v8i8 directly map, so they are easier to generate. Other sizes have some more overhead, so it makes sense to split them up. I've seen the logic for this in X86 instruction selection, so I guess that could just be reused. I'll do that in a follow-up, once this is merged.
Yes, good catch. I've added a check and a test.
Makes sense. There is actually no reason to limit this to and-masked masks. So I added a path that either expects all elements to have an AND mask or none. If there are no ANDs, we can omit it and directly emit a tbl. I was looking at this from Clang's __builtin_shufflevector() (as this is my use) and that always adds an AND.