Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Added all the variants of uaddlv, added 9 more unit tests & reduced the code for the pattern match.
Added more patterns to match different types of vector insertion for the same set of 'uaddlv' functions
The code generated for inserting the scalar result of a 'uaddlv' intrinsic function to a vector exhibits a redundant move to the integer register first, & then moves it back to the destination vector Neon register. This can be done directly. In my understanding, this issue occurs because the selected instruction for vector insertion expects an i32/i64 integer register to hold the value and the selected instruction for the 'uaddlv' function is giving an output of i32/i64 at the instruction selection time, whereas in practice the result of 'uaddlv' goes to a Neon register. Therefore I added patterns to match the combined use of these two instructions (uaddlv & vector_insert), to generate the direct move from the source to destination Neon register.
I am seeking feedback on whether there exists a better/more generic way of doing this, since currently we are matching patterns case by case. Even though I tried to be exhaustive, it's easy to miss a pattern. Please let me know your thoughts on this.
Tablegen patterns are always an option - there is no problem with adding those. Like you say the lowering of ADDV reductions is a little odd in places. If you wanted to fix this very generically it could be written as a post-isel peephole optimization, perhaps in AArch64MIPeepholeOpt. I've not looked deeply into the issues, but It could transform fmov w8, s1; mov.s v2[0], w8 to an lane move instruction, which replaces two fairly slow instructions with a fast one. Happening later it might help in more cases than just the reductions.
The other option is to do some lowering earlier: split the intrinsic into two operations (the reduction, and the vector extract), and select them separately. We already have AArch64ISD opcodes for other reductions; see combineAcrossLanesIntrinsic in AArch64ISelLowering.cpp. This could unlock additional optimizations. (For example, we could combine the insert with another shuffle.)
Thank you everyone for the feedback. I am currently working on the post-isel peephole optimization since that can capture the more generic pattern of
%y:gpr = COPY %x:fpr
%z:fpr = INSvi..gpr .., a, %y:gpr
& replace it with
%z:fpr = INSvi..lane .., a, %x:fpr, 0
which should apply to a larger class of code than vector insertion of the uaddlv intrinsic result.
I'll update the patch & problem description once I have a fully working solution.
Thanks - this looks like it should be pretty generic.
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp | ||
---|---|---|
561–565 | Will these ever not be true? | |
577 | Please clang-format. You can also drop the brackets from single-statement if's. | |
595 | It may not be Killed if it has other uses. | |
600 | If this is removing instructions, does it need to check that the COPYs have 1 use? | |
llvm/test/CodeGen/AArch64/fptosi-sat-vector.ll | ||
3505 | Are you using llvm/utils/update_llc_test_checks.py on the file? I'm not sure the indenting should change. | |
llvm/test/CodeGen/AArch64/peephole-insvigpr.mir | ||
112 | Some of this can often be removed to make the tests a little simple. |
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp | ||
---|---|---|
595 | If this is not covered by the existing tests, could you add one that covers this case? | |
llvm/test/CodeGen/AArch64/peephole-insvigpr.mir | ||
25 | you should be able to keep the function here to a minimum, i.e. define void @insert_vec_v6i64_uaddlv_from_v4i32() { ret void } Then you have to remove the references to LLVM IR in the MIR itself, .e.g. bb.0.entry: -> bb.0:` STRDui killed %17, %0, 2 :: (store (s64) into %ir.0 + 16) -> STRDui killed %17, %0, 2 :: (store (s64)) | |
86 | I think most of those should be removable. Also would be good to add the new test separately. | |
92 | you probably are also able to remove this. |
Addressed most comments of reviewers - removed kill flag on source register, deleted removal of dangling COPY instructions, removed unnecessary checks, updated test scripts.
Addressed the reviewer comments.
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp | ||
---|---|---|
561–565 | I haven't found a case where it does. Removed it in the latest patch. | |
577 |
Done
The check for the source register being virtual and of type FP128 is needed. I removed the rest. | |
595 |
I replaced the 'kill' flag with the original flag of the source register. | |
600 | Deleted the removal of dangling COPY instructions to allow them to be handled naturally. | |
llvm/test/CodeGen/AArch64/fptosi-sat-vector.ll | ||
3505 | Thank you for mentioning this - wasn't aware of this tool before. Updated the files. |
Thanks. From what I can see this LGTM.
The Peephole optimizations have a habit of causing subtle problems. Please make sure you run a bootstrap at least to see if any problems arise.
Thank you for the review. I ran all unit tests, llvm-lit tests with llvm-test-suite & a bootstrap build test.
Will these ever not be true?