Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
to v2f64 extending loads not produced by legalization were already being emitted
correctly.) Instead, mark v2f32 to v2f64 extending loads as legal and explicitly
lower them using promote_low. This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
to v2f64 extending loads not produced by legalization were already being emitted
correctly.)
Why v2f32 to v2f64 extending loads not produced by legalization are currently handled fine but not the one produced by legalization? I guess I lack the knowledge of order of transformation within isel..
This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.
Can you give an example of this case?
llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td | ||
---|---|---|
1292 | Any reason for using unindexedload instead of load? | |
1293 | Is this necessary? unindexedload already sets IsLoad to true. | |
llvm/test/CodeGen/WebAssembly/simd-load-promote-wide.ll | ||
10–12 | ||
30 | Can we do a similar optimization for integer vectors, such as extending <4 x i32> to <4 x i64>, using v128.load64_zero followed by i64x2.extend_low_i32x4_s/u (and for other integer types too)? Or is it possible to use v128.load to load everything at once and use i64x.2_extend_low_... and i64x2.extend_high_... to extend each part? I'm just asking for a possibility and of course don't mean we do that in this CL ;) | |
llvm/test/CodeGen/WebAssembly/simd-offset.ll | ||
2960–2962 |
v2f32 to v2f64 extending loads were previously lowered like this:
SelectionDAG has 7 nodes: t0: ch = EntryToken t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0> t5: v2f32,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32 t6: v2f64 = fp_extend t5 t7: ch = WebAssemblyISD::RETURN t0, t6 ... Optimized type-legalized selection DAG: %bb.0 'load_promote_v2f62:' SelectionDAG has 15 nodes: t0: ch = EntryToken t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0> t8: i64,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32 t9: v2i64 = scalar_to_vector t8 t10: v4f32 = bitcast t9 t12: f32 = extract_vector_elt t10, Constant:i32<0> t13: f64 = fp_extend t12 t15: f32 = extract_vector_elt t10, Constant:i32<1> t16: f64 = fp_extend t15 t17: v2f64 = BUILD_VECTOR t13, t16 t7: ch = WebAssemblyISD::RETURN t0, t17
but v4f32 to v4f64 extending loads were lowered like this:
SelectionDAG has 14 nodes: t0: ch = EntryToken t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0> t6: ch = CopyToReg t0, Register:i32 %0, t2 t7: i32 = Constant<0> t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1> t9: v4f32,ch = load<(load (s128) from %ir.p)> t6, t4, undef:i32 t10: v4f64 = fp_extend t9 t11: i32,ch = CopyFromReg t0, Register:i32 %0 t12: ch = store<(store (s256))> t6, t10, t11, undef:i32 t13: ch = WebAssemblyISD::RETURN t12 ... Optimized type-legalized selection DAG: %bb.0 'load_promote_v4f64:' SelectionDAG has 19 nodes: t0: ch = EntryToken t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1> t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0> t6: ch = CopyToReg t0, Register:i32 %0, t2 t11: i32,ch = CopyFromReg t0, Register:i32 %0 t17: v2f64,ch = load<(load (s64) from %ir.p, align 16), anyext from v2f32> t6, t4, undef:i32 t22: ch = store<(store (s128), align 32)> t6, t17, t11, undef:i32 t19: i32 = add nuw t4, Constant:i32<8> t20: v2f64,ch = load<(load (s64) from %ir.p + 8, basealign 16), anyext from v2f32> t6, t19, undef:i32 t24: i32 = add nuw t11, Constant:i32<16> t25: ch = store<(store (s128) into unknown-address + 16, basealign 32)> t6, t20, t24, undef:i32 t26: ch = TokenFactor t22, t25 t13: ch = WebAssemblyISD::RETURN t26
So the original loads get split up in both cases, but the reason is different. For the narrower vectors, the v2f32 is not a legal type so the load gets split by type legalization, but for the wider vectors, it's the v4f64 result of the fp_extend that gets split up with the load in the operand position. I guess the different splitting code paths there lead to completely different DAGs later.
This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.Can you give an example of this case?
All of the new load_promote tests in simd-offset.ll have to explicitly calculate the load offset rather than just use the offset immediate on the load. Before this change they would use the offset immediate.
llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td | ||
---|---|---|
1292 | load sets IsNonExtLoad = true, which is not what we want here. Apparently an "unindexed" load is a normal load and the other kinds of loads also perform address computations and return those results in the same instruction. | |
1293 | I believe so. We are inheriting from PatFrag, not from unindexedload directly, and I think the whole new fragment has to be marked as a load. Looking at the definitions of e.g. extloadf32, though, we could use extload instead of unindexedload and get rid of the IsAnyExtLoad = true line. | |
llvm/test/CodeGen/WebAssembly/simd-load-promote-wide.ll | ||
30 | Yes, there is room to further optimize the integer version of this. Right now <4 x i32> to <4 x i64> extending loads are lowered to a pattern of one v128.load followed by some shuffles and vector shifts. That doesn't seem too bad, but it would be better to use the extend_low and extend_high instructions as you suggest instead. Even here it might be better to do a v128.load followed by a promote_low, shuffle, and second promote_low rather than doing two 128.load64_zero. I'll check with the SIMD experts on that... |
Thanks for the explanation! Then after this CL, does promote_low part for LowerConvertLow become redundant? Hmm, maybe not completely, because this has a capability of shuffling? (By the way we don't seem to have a test of shuffled promote_low case.. simd-conversions.ll only contains integer vector to v2f64 tests. It would be good to have one for float vectors to v2f64 too) But given that LowerConvertLow function also takes care of integer to float conversions, we cannot remove that function anyway though. (Unless we do a similar optimization for integer vectors)
llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td | ||
---|---|---|
1293 | That's nice. By the way extload also sets IsLoad to true, so this line will still be unnecessary. |
That lowering is used in cases where there is no load, so I don't think it's redundant. We also already have tests for float to double promote_low and one test that uses the shuffle behavior: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/WebAssembly/simd-conversions.ll#L431-L443
Any reason for using unindexedload instead of load?