This allows to avoid scratch use or indirect VGPR addressing for small vectors.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Mostly looks good to me.
However, why does code with undef vectors look so bad? For example, in float4_inselt, the fact that the initial vector is undef should allow us to just store a splat of 1.0.
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
8118–8119 ↗ | (On Diff #174295) | clang-format? |
Yes, I noticed that too. That needs to be a separate optimization. As far as I understand "insert_vector_element undef, %var, %idx" should not even come to this point. It needs to be replaced by build_vector (n x %var) regardless of the thresholds and heuristics I am using, e.g. earlier (higher in the same function I think).
I disagree. When we end up using scratch memory for a vector, build_vector (n x %var) would imply n stores, while insert_vector_element undef, %var, %idx implies only 1 store, so doing the transform seems like a pretty terrible idea in general. I think exploiting undef is really specific to what you're doing here and should therefore be done here as well.
Another change I've wanted to look at is changing what AMDGPUPromoteAlloca tries to produce. The dynamic vector indexing is going to be worse if the waterfall loop is going to be required, but it's currently preferred if both are possible.
Essentially it deals with the alloca to begin with. Scratch is always worse than movrel (even with the waterfall loop), movrel is worse than a small set of selects. So this is quite natural we can get either a set of selects or a movrel from an alloca. Unless we end up spilling later, but this is too early to estimate register pressure here.
Anyway, any ideas what can it produce better?
On the second though I still believe it shall be done elsewhere, in the common DAG combiber. That is the DAG this code produces:
t28: f32 = select_cc t19, Constant:i32<0>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch t30: f32 = select_cc t19, Constant:i32<1>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch t32: f32 = select_cc t19, Constant:i32<2>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch t34: f32 = select_cc t19, Constant:i32<3>, ConstantFP:f32<1.000000e+00>, undef:f32, seteq:ch t35: v4f32 = BUILD_VECTOR t28, t30, t32, t34
and then:
t40: i1 = setcc t19, Constant:i32<0>, seteq:ch t41: f32 = select t40, ConstantFP:f32<1.000000e+00>, undef:f32 t42: i1 = setcc t19, Constant:i32<1>, seteq:ch t43: f32 = select t42, ConstantFP:f32<1.000000e+00>, undef:f32 t44: i1 = setcc t19, Constant:i32<2>, seteq:ch t45: f32 = select t44, ConstantFP:f32<1.000000e+00>, undef:f32 t46: i1 = setcc t19, Constant:i32<3>, seteq:ch t47: f32 = select t46, ConstantFP:f32<1.000000e+00>, undef:f32 t35: v4f32 = BUILD_VECTOR t41, t43, t45, t47
So it looks like an appropriate optimization is:
select cc, x, undef => x
Converted tests to non-undef vectors, added a test with an undef vector.
This is now based on D54646 and insert_vector_elt undef, ... results just in a splat.
Yeah, the separate DAG combine for scalar select w/ undef is a better solution. LGTM.