This un-vectorizes vector loads into scalar loads whenever they are used in a shuffle_vector that are splats, since only one element from them will be accessed.
This opens up better instruction selection of splatted loads and should address a WebAssembly issue: https://github.com/llvm/llvm-project/issues/59120
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Thanks! It looks like some other X86 tests are still failing, I'll try to address them also.
On aarch64 some tests in arm64-dup.ll are failing:
define <8 x i8> @vduplane8(<8 x i8>* %A) nounwind { ; CHECK-LABEL: vduplane8: ; CHECK: // %bb.0: ; CHECK-NEXT: ldr d0, [x0] ; CHECK-NEXT: dup.8b v0, v0[1] ; CHECK-NEXT: ret %tmp1 = load <8 x i8>, <8 x i8>* %A %tmp2 = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> < i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1 > ret <8 x i8> %tmp2 }
Since they're now being selected as a broadcasted load:
vduplane8: // @vduplane8 // %bb.0: add x8, x0, #1 ld1r.8b { v0 }, [x8] ret
I'm not familiar with aarch64, but is it not possible to fold the offset in like this?
vduplane8: // @vduplane8 // %bb.0: ld1r.8b { v0 }, [x0], #1 ret
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
138–142 ↗ | (On Diff #483182) | I presume this a regression since even though it's loading smaller sizes, it has to do more twiddling. |
llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll | ||
1149–1162 ↗ | (On Diff #483182) | These extra lines replace the old P8-AIX prefixed checks that must have been left behind |
llvm/test/CodeGen/X86/half.ll | ||
1342–1344 ↗ | (On Diff #483182) | @pengfei This looks like a regression, the scalarized load t18 gets selected as VPINSRWrm t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t17: i64 = add t2, Constant:i64<8> t18: f16,ch = load<(load (s16) from %ir.p + 8, align 8)> t0, t17, undef:i64 t21: v8f16 = scalar_to_vector t18 t23: v8i16 = bitcast t21 t28: v8i16 = X86ISD::PSHUFLW t23, TargetConstant:i8<0> t29: v4i32 = bitcast t28 t30: v4i32 = X86ISD::PSHUFD t29, TargetConstant:i8<0> t36: v8f16 = bitcast t30 t10: ch,glue = CopyToReg t0, Register:v8f16 $xmm0, t36 t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8f16 $xmm0, t10:1 |
llvm/test/CodeGen/AArch64/arm64-vmul.ll | ||
---|---|---|
1102–1106 ↗ | (On Diff #483185) | Regression: Can the add be folded in as an immediate offset to ld1r.8h { v0 }, [x8]? |
llvm/test/CodeGen/X86/half.ll | ||
---|---|---|
1342–1344 ↗ | (On Diff #483182) | Right. I think this is a special case. We don't have native scalar instructions to load/store half or bfloat in old targets. Instead, we have to use the more expensive pinsrw/pextrw to emulate. Which makes scalar load/store operations are suboptimal to vector ones. |
llvm/test/CodeGen/AArch64/arm64-vmul.ll | ||
---|---|---|
1102–1106 ↗ | (On Diff #483185) | No, since the offset would actually increment the register operand. |
llvm/test/CodeGen/AArch64/arm64-vmul.ll | ||
---|---|---|
1102–1106 ↗ | (On Diff #483185) | Hello. This can actually be worse, but I don't think that's an issue with this patch. We tend to prefer ld1r over the dup in the mul instruction, but I believe the opposite can be quicker. That is a general issue though, this patch is just exposing it in a few extra places. None of these tests need to load data, and I think it would be better to remove that part. If they just use vectors parameters directly then they will be less susceptible to optimizations on the load altering what the test is intended to check. I can put a patch together to clean this up, and if you rebase on top most of these changes should hopefully disappear. |
llvm/test/CodeGen/AArch64/arm64-vmul.ll | ||
---|---|---|
1102–1106 ↗ | (On Diff #483185) | Hopefully if you rebase over rG752819e813d1, most of these changes will go away. |
llvm/test/CodeGen/AArch64/arm64-vmul.ll | ||
---|---|---|
1102–1106 ↗ | (On Diff #483185) | Hello, thanks a million for that patch. Will rebase. |
llvm/test/CodeGen/AArch64/arm64-vmul.ll | ||
---|---|---|
1227–1230 ↗ | (On Diff #483185) | @dmgreen This looks like a regression to me but I'm not familiar enough with aarch64 to really know for certain. I presume the cost of the additional add instruction outweighs any gains from a smaller load, is that correct? (Hope I'm not bombarding you with too many questions, let me know if there's someone else I can ask!) |
Thanks. The remaining Arm and AArch64 test cases look OK to me.
llvm/test/CodeGen/AArch64/arm64-vmul.ll | ||
---|---|---|
1227–1230 ↗ | (On Diff #483185) | Yeah it's the same as the other cases. I heard that it can be a little worse than if the dup could be part of the mul/fmulx/etc, but that's a separate issue from this patch. In practice many cases will already be a splat of a scalar value, so will already run into the same issue. |
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
1905–1914 ↗ | (On Diff #484287) | @foad @ruiling Apologies if I'm pinging the wrong people here, just wanted to get some AMDGPU eyes over this. |
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
1905–1914 ↗ | (On Diff #484287) | Yes it looks like a regression but I'm not sure how serious it is. The original code did two 4-byte loads even though we only want the upper two bytes of each value. Now we've turned the second one into a 2-byte load that overwrites part of the result of the first load, hence the WAW dependency. Why can't we also turn the first load into a 2-byte load? |
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
1905–1914 ↗ | (On Diff #484287) | An extra wait usually means serious regression. But I did not see why we need the s_waitcnt here. The global_loads should return the value in order, so there is no WAW dependency here, right? @foad |
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
1905–1914 ↗ | (On Diff #484287) | The waitcnt insertion pass probably doesn't try to understand the tied operands of the d16 loads |
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
1905–1914 ↗ | (On Diff #484287) | Right, at the MIR level it looks like a RAW dependency because the d16 load has a tied read representing the parts of the destination register that are not overwritten. So I guess we *could* fix this in the waitcnt insertion pass. (It sounds similar to the special case for writelane in AMDGPUInsertDelayAlu.) |
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
1905–1914 ↗ | (On Diff #484287) | D140537 should fix this when it lands. |
llvm/test/CodeGen/X86/avx-vbroadcast.ll | ||
---|---|---|
367 ↗ | (On Diff #487360) | ; Pointer adjusted broadcasts |
Update AMDGPU tests with D140537
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll | ||
---|---|---|
1905–1914 ↗ | (On Diff #484287) | Thanks! |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
22841 | Done: Is there another optimisation that occurs on a splat w/ a single non-undef mask element? |
llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll | ||
---|---|---|
1149–1162 ↗ | (On Diff #483182) | |
llvm/test/CodeGen/WebAssembly/simd-vectorized-load-splat.ll | ||
6 | please can you pre-commit this test file to trunk with trunk's current codegen and then rebase to show the codegen diffs |
llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll | ||
---|---|---|
1149–1162 ↗ | (On Diff #483182) | Sorry you're right, not sure why I believed that. So if I'm understanding this correctly now, P8-AIX-NEXT checks are generated whenever aix-32 and aix-64 have the same lines. |
Similar to D140811 - we might want to limit this to cases where the shuffle is not just a single non-undef mask element