This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Convert insert_vector_elt into set of selects
ClosedPublic

Authored by rampitec on Nov 15 2018, 3:45 PM.

Details

Summary

This allows to avoid scratch use or indirect VGPR addressing for small vectors.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Nov 15 2018, 3:45 PM

Mostly looks good to me.

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

lib/Target/AMDGPU/SIISelLowering.cpp
8118–8119 ↗(On Diff #174295)

clang-format?

rampitec updated this revision to Diff 174385.Nov 16 2018, 9:31 AM
rampitec marked an inline comment as done.

Run clang-format.

Mostly looks good to me.

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).

I disagree. When we end up using scratch memory for a vector, build_vector (n x %var) would imply n stores, while insert_vector_element undef, %var, %idx implies only 1 store, so doing the transform seems like a pretty terrible idea in general. I think exploiting undef is really specific to what you're doing here and should therefore be done here as well.

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).

I disagree. When we end up using scratch memory for a vector, build_vector (n x %var) would imply n stores, while insert_vector_element undef, %var, %idx implies only 1 store, so doing the transform seems like a pretty terrible idea in general. I think exploiting undef is really specific to what you're doing here and should therefore be done here as well.

Does this actually happen? I thought we custom lowered all of the vector operations

Another change I've wanted to look at is changing what AMDGPUPromoteAlloca tries to produce. The dynamic vector indexing is going to be worse if the waterfall loop is going to be required, but it's currently preferred if both are possible.

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).

I disagree. When we end up using scratch memory for a vector, build_vector (n x %var) would imply n stores, while insert_vector_element undef, %var, %idx implies only 1 store, so doing the transform seems like a pretty terrible idea in general. I think exploiting undef is really specific to what you're doing here and should therefore be done here as well.

Hm... Ok, I will add it here.

Another change I've wanted to look at is changing what AMDGPUPromoteAlloca tries to produce. The dynamic vector indexing is going to be worse if the waterfall loop is going to be required, but it's currently preferred if both are possible.

Essentially it deals with the alloca to begin with. Scratch is always worse than movrel (even with the waterfall loop), movrel is worse than a small set of selects. So this is quite natural we can get either a set of selects or a movrel from an alloca. Unless we end up spilling later, but this is too early to estimate register pressure here.

Anyway, any ideas what can it produce better?

However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.

Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).

I disagree. When we end up using scratch memory for a vector, build_vector (n x %var) would imply n stores, while insert_vector_element undef, %var, %idx implies only 1 store, so doing the transform seems like a pretty terrible idea in general. I think exploiting undef is really specific to what you're doing here and should therefore be done here as well.

Hm... Ok, I will add it here.

On the second though I still believe it shall be done elsewhere, in the common DAG combiber. That is the DAG this code produces:

  t28: f32 = select_cc t19, Constant:i32<0>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch
  t30: f32 = select_cc t19, Constant:i32<1>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch
  t32: f32 = select_cc t19, Constant:i32<2>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch
  t34: f32 = select_cc t19, Constant:i32<3>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch
t35: v4f32 = BUILD_VECTOR t28, t30, t32, t34

and then:

    t40: i1 = setcc t19, Constant:i32<0>, seteq:ch
  t41: f32 = select t40, ConstantFP:f32<1.000000e+00>, undef:f32
    t42: i1 = setcc t19, Constant:i32<1>, seteq:ch
  t43: f32 = select t42, ConstantFP:f32<1.000000e+00>, undef:f32
    t44: i1 = setcc t19, Constant:i32<2>, seteq:ch
  t45: f32 = select t44, ConstantFP:f32<1.000000e+00>, undef:f32
    t46: i1 = setcc t19, Constant:i32<3>, seteq:ch
  t47: f32 = select t46, ConstantFP:f32<1.000000e+00>, undef:f32
t35: v4f32 = BUILD_VECTOR t41, t43, t45, t47

So it looks like an appropriate optimization is:

select cc, x, undef => x
rampitec updated this revision to Diff 174448.Nov 16 2018, 2:23 PM

Converted tests to non-undef vectors, added a test with an undef vector.
This is now based on D54646 and insert_vector_elt undef, ... results just in a splat.

nhaehnle accepted this revision.Nov 19 2018, 3:54 AM

Yeah, the separate DAG combine for scalar select w/ undef is a better solution. LGTM.

This revision is now accepted and ready to land.Nov 19 2018, 3:54 AM
This revision was automatically updated to reflect the committed changes.