This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Scalarize vectorized loads that are splatted
AbandonedPublic

Authored by luke on Dec 14 2022, 4:47 PM.

Details

Summary

This un-vectorizes vector loads into scalar loads whenever they are used in a shuffle_vector that are splats, since only one element from them will be accessed.
This opens up better instruction selection of splatted loads and should address a WebAssembly issue: https://github.com/llvm/llvm-project/issues/59120

Diff Detail

Event Timeline

luke created this revision.Dec 14 2022, 4:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 4:47 PM
luke requested review of this revision.Dec 14 2022, 4:47 PM
luke updated this revision to Diff 483036.Dec 14 2022, 4:50 PM

Fix commit message

Checked changes in X86 tests are all correct.

luke added a comment.EditedDec 15 2022, 3:41 AM

Checked changes in X86 tests are all correct.

Thanks! It looks like some other X86 tests are still failing, I'll try to address them also.

On aarch64 some tests in arm64-dup.ll are failing:

define <8 x i8> @vduplane8(<8 x i8>* %A) nounwind {
; CHECK-LABEL: vduplane8:
; CHECK:       // %bb.0:
; CHECK-NEXT:    ldr d0, [x0]
; CHECK-NEXT:    dup.8b v0, v0[1]
; CHECK-NEXT:    ret
	%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> < i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1 >
	ret <8 x i8> %tmp2
}

Since they're now being selected as a broadcasted load:

vduplane8:                              // @vduplane8
// %bb.0:
	add	x8, x0, #1
	ld1r.8b	{ v0 }, [x8]
	ret

I'm not familiar with aarch64, but is it not possible to fold the offset in like this?

vduplane8:                              // @vduplane8
// %bb.0:
	ld1r.8b	{ v0 }, [x0], #1
	ret
luke updated this revision to Diff 483182.Dec 15 2022, 7:27 AM

Update tests

luke added inline comments.Dec 15 2022, 7:35 AM
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
138–142 ↗(On Diff #483182)

I presume this a regression since even though it's loading smaller sizes, it has to do more twiddling.

llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
1149–1162 ↗(On Diff #483182)

These extra lines replace the old P8-AIX prefixed checks that must have been left behind

llvm/test/CodeGen/X86/half.ll
1342–1344 ↗(On Diff #483182)

@pengfei This looks like a regression, the scalarized load t18 gets selected as VPINSRWrm

t0: ch,glue = EntryToken
                  t2: i64,ch = CopyFromReg t0, Register:i64 %0
                t17: i64 = add t2, Constant:i64<8>
              t18: f16,ch = load<(load (s16) from %ir.p + 8, align 8)> t0, t17, undef:i64
            t21: v8f16 = scalar_to_vector t18
          t23: v8i16 = bitcast t21
        t28: v8i16 = X86ISD::PSHUFLW t23, TargetConstant:i8<0>
      t29: v4i32 = bitcast t28
    t30: v4i32 = X86ISD::PSHUFD t29, TargetConstant:i8<0>
  t36: v8f16 = bitcast t30
t10: ch,glue = CopyToReg t0, Register:v8f16 $xmm0, t36
t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8f16 $xmm0, t10:1
luke updated this revision to Diff 483185.Dec 15 2022, 7:38 AM

Update aarch64 tests

luke added inline comments.Dec 15 2022, 7:39 AM
llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗(On Diff #483185)

Regression: Can the add be folded in as an immediate offset to ld1r.8h { v0 }, [x8]?
Same applies for the cases below

luke marked 3 inline comments as not done.Dec 15 2022, 7:42 AM
pengfei added inline comments.Dec 15 2022, 9:33 PM
llvm/test/CodeGen/X86/half.ll
1342–1344 ↗(On Diff #483182)

Right. I think this is a special case. We don't have native scalar instructions to load/store half or bfloat in old targets. Instead, we have to use the more expensive pinsrw/pextrw to emulate. Which makes scalar load/store operations are suboptimal to vector ones.
Have you noticed if other targets have a similar problem. It's better if we can find a way to avoid the regression, otherwise, I think we can add FIXME at the moment.

luke added inline comments.
llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗(On Diff #483185)

No, since the offset would actually increment the register operand.
Should we instead check if the target is able to perform an indexed load, and bail otherwise when the offset != 0?
There's a target lowering hook called isIndexingLegal that seems like it could be used to check this, but no targets currently implement it, and it was added for GlobalISel: https://reviews.llvm.org/D66287
@t.p.northover would you have any thoughts on this?

dmgreen added inline comments.
llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗(On Diff #483185)

Hello. This can actually be worse, but I don't think that's an issue with this patch. We tend to prefer ld1r over the dup in the mul instruction, but I believe the opposite can be quicker. That is a general issue though, this patch is just exposing it in a few extra places.

None of these tests need to load data, and I think it would be better to remove that part. If they just use vectors parameters directly then they will be less susceptible to optimizations on the load altering what the test is intended to check. I can put a patch together to clean this up, and if you rebase on top most of these changes should hopefully disappear.

dmgreen added inline comments.Dec 20 2022, 8:21 AM
llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗(On Diff #483185)

Hopefully if you rebase over rG752819e813d1, most of these changes will go away.

luke added inline comments.Dec 20 2022, 8:22 AM
llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗(On Diff #483185)

Hello, thanks a million for that patch. Will rebase.

luke added inline comments.Dec 20 2022, 8:26 AM
llvm/test/CodeGen/AArch64/arm64-vmul.ll
1227–1230 ↗(On Diff #483185)

@dmgreen This looks like a regression to me but I'm not familiar enough with aarch64 to really know for certain. I presume the cost of the additional add instruction outweighs any gains from a smaller load, is that correct?

(Hope I'm not bombarding you with too many questions, let me know if there's someone else I can ask!)

Thanks. The remaining Arm and AArch64 test cases look OK to me.

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1227–1230 ↗(On Diff #483185)

Yeah it's the same as the other cases. I heard that it can be a little worse than if the dup could be part of the mul/fmulx/etc, but that's a separate issue from this patch. In practice many cases will already be a splat of a scalar value, so will already run into the same issue.

luke added inline comments.
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1905–1914 ↗(On Diff #484287)

@foad @ruiling Apologies if I'm pinging the wrong people here, just wanted to get some AMDGPU eyes over this.
From what I understand this looks like a regression since the two loads aren't dispatched in tandem anymore, there's separate waits. Are there any suggestions as to how to avoid this/are there any target info hooks that might be relevant here?

foad added inline comments.
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1905–1914 ↗(On Diff #484287)

Yes it looks like a regression but I'm not sure how serious it is.

The original code did two 4-byte loads even though we only want the upper two bytes of each value. Now we've turned the second one into a 2-byte load that overwrites part of the result of the first load, hence the WAW dependency. Why can't we also turn the first load into a 2-byte load?

Also @rampitec @arsenm

ruiling added inline comments.Dec 21 2022, 4:34 AM
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1905–1914 ↗(On Diff #484287)

An extra wait usually means serious regression. But I did not see why we need the s_waitcnt here. The global_loads should return the value in order, so there is no WAW dependency here, right? @foad

arsenm added inline comments.Dec 21 2022, 5:01 AM
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1905–1914 ↗(On Diff #484287)

The waitcnt insertion pass probably doesn't try to understand the tied operands of the d16 loads

foad added inline comments.Dec 22 2022, 12:06 AM
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1905–1914 ↗(On Diff #484287)

Right, at the MIR level it looks like a RAW dependency because the d16 load has a tied read representing the parts of the destination register that are not overwritten. So I guess we *could* fix this in the waitcnt insertion pass. (It sounds similar to the special case for writelane in AMDGPUInsertDelayAlu.)

RKSimon added inline comments.Dec 22 2022, 2:10 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
22844

Cleaner to use Shuf->getValueType(0) - result/operands will have matching types in DAG.

22849

Please check the operand numbers -
LHS (Op0) = 0 <= Idx < NumElts
RHS (Op1) = NumElts <= Idx < 2*NumElts

22860

Pull out repeated Shuf->getSplatIndex() calls

luke updated this revision to Diff 487360.Jan 9 2023, 2:54 AM

Address review comments

luke marked 3 inline comments as done.Jan 9 2023, 2:55 AM
foad added inline comments.Jan 9 2023, 3:01 AM
llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1905–1914 ↗(On Diff #484287)

D140537 should fix this when it lands.

RKSimon added inline comments.Jan 9 2023, 3:01 AM
llvm/test/CodeGen/X86/avx-vbroadcast.ll
367 ↗(On Diff #487360)

; Pointer adjusted broadcasts

luke updated this revision to Diff 488557.Jan 12 2023, 2:31 AM

Rebase and update comments

luke updated this revision to Diff 488569.Jan 12 2023, 2:56 AM

Update AMDGPU tests with D140537

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1905–1914 ↗(On Diff #484287)

Thanks!

RKSimon added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
22841

Similar to D140811 - we might want to limit this to cases where the shuffle is not just a single non-undef mask element

luke updated this revision to Diff 488953.Jan 13 2023, 4:53 AM

Limit to cases where the shuffle is not just a single non-undef mask element

luke marked an inline comment as done.Jan 13 2023, 4:57 AM
luke added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
22841

Done: Is there another optimisation that occurs on a splat w/ a single non-undef mask element?

luke updated this revision to Diff 488962.Jan 13 2023, 5:15 AM
luke marked an inline comment as done.

Update affected tests

RKSimon added inline comments.Jan 14 2023, 2:07 PM
llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
1149–1162 ↗(On Diff #483182)

Aren't the P8-AIX/P8-AIX-32/P8-AIX-64 checks still used? There's more likely an issue with aix-32 having a legal <2 x i64> type but no the i64 type

llvm/test/CodeGen/WebAssembly/simd-vectorized-load-splat.ll
6

please can you pre-commit this test file to trunk with trunk's current codegen and then rebase to show the codegen diffs

luke updated this revision to Diff 489477.Jan 16 2023, 2:44 AM

Split out webassembly test into pre-commit test

luke marked 6 inline comments as done.Jan 16 2023, 4:23 AM
luke added inline comments.Jan 16 2023, 4:30 AM
llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
1149–1162 ↗(On Diff #483182)

Sorry you're right, not sure why I believed that. So if I'm understanding this correctly now, P8-AIX-NEXT checks are generated whenever aix-32 and aix-64 have the same lines.
Will take a look

luke updated this revision to Diff 489506.Jan 16 2023, 4:48 AM

Only scalarize if scalar type is legal

luke abandoned this revision.Apr 3 2023, 4:35 AM