This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Lower v2f32 to v2f64 extending loads with promote_low
ClosedPublic

Authored by tlively on Aug 20 2021, 4:36 PM.

Details

Summary

Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
to v2f64 extending loads not produced by legalization were already being emitted
correctly.) Instead, mark v2f32 to v2f64 extending loads as legal and explicitly
lower them using promote_low. This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.

Diff Detail

Event Timeline

tlively created this revision.Aug 20 2021, 4:36 PM
tlively requested review of this revision.Aug 20 2021, 4:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2021, 4:36 PM
aheejin added a comment.EditedAug 21 2021, 9:40 PM

Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
to v2f64 extending loads not produced by legalization were already being emitted
correctly.)

Why v2f32 to v2f64 extending loads not produced by legalization are currently handled fine but not the one produced by legalization? I guess I lack the knowledge of order of transformation within isel..

This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.

Can you give an example of this case?

llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td
1292

Any reason for using unindexedload instead of load?

1293

Is this necessary? unindexedload already sets IsLoad to true.

llvm/test/CodeGen/WebAssembly/simd-load-promote-wide.ll
10–12
30

Can we do a similar optimization for integer vectors, such as extending <4 x i32> to <4 x i64>, using v128.load64_zero followed by i64x2.extend_low_i32x4_s/u (and for other integer types too)? Or is it possible to use v128.load to load everything at once and use i64x.2_extend_low_... and i64x2.extend_high_... to extend each part?

I'm just asking for a possibility and of course don't mean we do that in this CL ;)

llvm/test/CodeGen/WebAssembly/simd-offset.ll
2960–2962
tlively marked an inline comment as done.Aug 25 2021, 4:43 PM

Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
to v2f64 extending loads not produced by legalization were already being emitted
correctly.)

Why v2f32 to v2f64 extending loads not produced by legalization are currently handled fine but not the one produced by legalization? I guess I lack the knowledge of order of transformation within isel..

v2f32 to v2f64 extending loads were previously lowered like this:

SelectionDAG has 7 nodes:
  t0: ch = EntryToken
        t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
      t5: v2f32,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32
    t6: v2f64 = fp_extend t5
  t7: ch = WebAssemblyISD::RETURN t0, t6

...

Optimized type-legalized selection DAG: %bb.0 'load_promote_v2f62:'
SelectionDAG has 15 nodes:
  t0: ch = EntryToken
        t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
      t8: i64,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32
    t9: v2i64 = scalar_to_vector t8
  t10: v4f32 = bitcast t9
        t12: f32 = extract_vector_elt t10, Constant:i32<0>
      t13: f64 = fp_extend t12
        t15: f32 = extract_vector_elt t10, Constant:i32<1>
      t16: f64 = fp_extend t15
    t17: v2f64 = BUILD_VECTOR t13, t16
  t7: ch = WebAssemblyISD::RETURN t0, t17

but v4f32 to v4f64 extending loads were lowered like this:

SelectionDAG has 14 nodes:
  t0: ch = EntryToken
    t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
  t6: ch = CopyToReg t0, Register:i32 %0, t2
  t7: i32 = Constant<0>
          t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1>
        t9: v4f32,ch = load<(load (s128) from %ir.p)> t6, t4, undef:i32
      t10: v4f64 = fp_extend t9
      t11: i32,ch = CopyFromReg t0, Register:i32 %0
    t12: ch = store<(store (s256))> t6, t10, t11, undef:i32
  t13: ch = WebAssemblyISD::RETURN t12

...

Optimized type-legalized selection DAG: %bb.0 'load_promote_v4f64:'
SelectionDAG has 19 nodes:
  t0: ch = EntryToken
  t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1>
    t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
  t6: ch = CopyToReg t0, Register:i32 %0, t2
  t11: i32,ch = CopyFromReg t0, Register:i32 %0
        t17: v2f64,ch = load<(load (s64) from %ir.p, align 16), anyext from v2f32> t6, t4, undef:i32
      t22: ch = store<(store (s128), align 32)> t6, t17, t11, undef:i32
          t19: i32 = add nuw t4, Constant:i32<8>
        t20: v2f64,ch = load<(load (s64) from %ir.p + 8, basealign 16), anyext from v2f32> t6, t19, undef:i32
        t24: i32 = add nuw t11, Constant:i32<16>
      t25: ch = store<(store (s128) into unknown-address + 16, basealign 32)> t6, t20, t24, undef:i32
    t26: ch = TokenFactor t22, t25
  t13: ch = WebAssemblyISD::RETURN t26

So the original loads get split up in both cases, but the reason is different. For the narrower vectors, the v2f32 is not a legal type so the load gets split by type legalization, but for the wider vectors, it's the v4f64 result of the fp_extend that gets split up with the load in the operand position. I guess the different splitting code paths there lead to completely different DAGs later.

This regresses the addressing modes supported for
the extloads not produced by legalization, but that's a fine trade off for now.

Can you give an example of this case?

All of the new load_promote tests in simd-offset.ll have to explicitly calculate the load offset rather than just use the offset immediate on the load. Before this change they would use the offset immediate.

llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td
1292

load sets IsNonExtLoad = true, which is not what we want here. Apparently an "unindexed" load is a normal load and the other kinds of loads also perform address computations and return those results in the same instruction.

1293

I believe so. We are inheriting from PatFrag, not from unindexedload directly, and I think the whole new fragment has to be marked as a load. Looking at the definitions of e.g. extloadf32, though, we could use extload instead of unindexedload and get rid of the IsAnyExtLoad = true line.

llvm/test/CodeGen/WebAssembly/simd-load-promote-wide.ll
30

Yes, there is room to further optimize the integer version of this. Right now <4 x i32> to <4 x i64> extending loads are lowered to a pattern of one v128.load followed by some shuffles and vector shifts. That doesn't seem too bad, but it would be better to use the extend_low and extend_high instructions as you suggest instead.

Even here it might be better to do a v128.load followed by a promote_low, shuffle, and second promote_low rather than doing two 128.load64_zero. I'll check with the SIMD experts on that...

tlively updated this revision to Diff 368776.Aug 25 2021, 4:44 PM
  • Address feedback
tlively updated this revision to Diff 368777.Aug 25 2021, 4:47 PM
  • Missed feedback
tlively marked an inline comment as done.Aug 25 2021, 4:47 PM
aheejin accepted this revision.Aug 27 2021, 6:23 PM

Thanks for the explanation! Then after this CL, does promote_low part for LowerConvertLow become redundant? Hmm, maybe not completely, because this has a capability of shuffling? (By the way we don't seem to have a test of shuffled promote_low case.. simd-conversions.ll only contains integer vector to v2f64 tests. It would be good to have one for float vectors to v2f64 too) But given that LowerConvertLow function also takes care of integer to float conversions, we cannot remove that function anyway though. (Unless we do a similar optimization for integer vectors)

llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td
1293

That's nice. By the way extload also sets IsLoad to true, so this line will still be unnecessary.

This revision is now accepted and ready to land.Aug 27 2021, 6:23 PM

Thanks for the explanation! Then after this CL, does promote_low part for LowerConvertLow become redundant? Hmm, maybe not completely, because this has a capability of shuffling? (By the way we don't seem to have a test of shuffled promote_low case.. simd-conversions.ll only contains integer vector to v2f64 tests. It would be good to have one for float vectors to v2f64 too) But given that LowerConvertLow function also takes care of integer to float conversions, we cannot remove that function anyway though. (Unless we do a similar optimization for integer vectors)

That lowering is used in cases where there is no load, so I don't think it's redundant. We also already have tests for float to double promote_low and one test that uses the shuffle behavior: https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/WebAssembly/simd-conversions.ll#L431-L443

tlively updated this revision to Diff 369977.Sep 1 2021, 10:23 AM
  • Remove unneccesary line
This revision was landed with ongoing or failed builds.Sep 1 2021, 10:27 AM
This revision was automatically updated to reflect the committed changes.