If we are only extracting vector elements via EXTRACT_VECTOR_ELT(s) we may be able to use SimplifyDemandedVectorElts to avoid unnecessary vector ops.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
test/CodeGen/AArch64/aarch64-be-bv.ll | ||
---|---|---|
33 | Am I missing something - why the extractelement - why not return the <8 x i16> add result directly? |
test/CodeGen/AArch64/aarch64-be-bv.ll | ||
---|---|---|
33 | Returning the result "directly" involves a bitcast, which is also likely to break in the future (this is big-endian, so it swaps the elements.) Maybe store the result to memory instead. |
You need to get reviewers for the test changes to AMDGPU and SystemZ. Otherwise LGTM.
test/CodeGen/X86/oddshuffles.ll | ||
---|---|---|
366 | It looks like the total instructions is increasing here? Maybe an issue with x86 shuffle lowering? |
Hmm ... The SystemZ tests seem to be getting strictly worse. Before, we have in f3:
vaf %v0, %v24, %v26 vlgvh %r0, %v0, 6 vlgvh %r2, %v28, 3 ar %r2, %r0
and after the patch you're testing for:
vaf %v0, %v24, %v26 vrepf %v0, %v0, 3 vlgvh %r0, %v0, 2 vlgvh %r2, %v28, 3 ar %r2, %r0
(And similar for f4.)
Given that the point of this test to ensure that there is no superfluous vrep, this seems a clear regression. Can you check what's going on here?
Updated with SystemZ fix to permit permute decode of target shuffles (well, SPLAT) as well - @uweigand does that look OK to you?
I've generalized DAGCombiner::visitEXTRACT_VECTOR_ELT to handle the case where the source vector has multiple uses, if all of them are EXTRACT_VECTOR_ELT we now accumulate the demanded mask accordingly - this simplifies some MIPS vector codegen so adding @atanasyan to take a look.
This is nice, but it's destroying the intent of the test, which is to check that we generate the correct movi instruction.