In the attached Hexagon test, the vector type <32 x i64> gets scalarized into a number of extracts and build_pairs. It is subsequently not optimized further, and instead of a single HVX instruction, we end up with a lot of scalar code.
Without this patch, before type legalization we have:
Optimized lowered selection DAG: %bb.0 'f0:b0' SelectionDAG has 15 nodes: t0: ch = EntryToken t2: v32i32,ch = CopyFromReg t0, Register:v32i32 %0 t19: v32i32 = splat_vector Constant:i32<1> t18: v32i32 = shl t2, t19 t7: v64i32 = concat_vectors t18, undef:v32i32 t9: v64i32 = vector_shuffle<0,u,1,u,2,u,3,u,4,u,5,u,6,u,7,u,8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u,16,u,17,u,18,u,19,u,20,u,21,u,22,u,23,u,24,u,25,u,26,u,27,u,28,u,29,u,30,u,31,u> t7, undef:v64i32 t10: v32i64 = bitcast t9 t11: v32i32 = truncate t10 t13: ch,glue = CopyToReg t0, Register:v32i32 $v0, t11 t14: ch = HexagonISD::RET_FLAG t13, Register:v32i32 $v0, t13:1
After
Type-legalized selection DAG: %bb.0 'f0:b0' SelectionDAG has 205 nodes: t0: ch = EntryToken t2: v32i32,ch = CopyFromReg t0, Register:v32i32 %0 t19: v32i32 = splat_vector Constant:i32<1> t18: v32i32 = shl t2, t19 t7: v64i32 = concat_vectors t18, undef:v32i32 t9: v64i32 = vector_shuffle<0,u,1,u,2,u,3,u,4,u,5,u,6,u,7,u,8,u,9,u,10,u,11,u,12,u,13,u,14,u,15,u,16,u,17,u,18,u,19,u,20,u,21,u,22,u,23,u,24,u,25,u,26,u,27,u,28,u,29,u,30,u,31,u> t7, undef:v64i32 t31: i32 = extract_vector_elt t9, Constant:i32<0> t32: i32 = extract_vector_elt t9, Constant:i32<1> t157: i64 = build_pair t31, t32 t227: i32 = truncate t157 t34: i32 = extract_vector_elt t9, Constant:i32<2> t36: i32 = extract_vector_elt t9, Constant:i32<3> t158: i64 = build_pair t34, t36 t229: i32 = truncate t158 t38: i32 = extract_vector_elt t9, Constant:i32<4> t40: i32 = extract_vector_elt t9, Constant:i32<5> t159: i64 = build_pair t38, t40 t231: i32 = truncate t159 t42: i32 = extract_vector_elt t9, Constant:i32<6> t44: i32 = extract_vector_elt t9, Constant:i32<7> t160: i64 = build_pair t42, t44 t233: i32 = truncate t160 [...]
This should be a harmless change, but there is one degradation on x86: llvm/test/CodeGen/X86/test-shrink.ll
The DAG before instruction selection:
SelectionDAG has 14 nodes: t0: ch = EntryToken t41: i32,ch = load<(load (s16) from %fixed-stack.1, align 4), zext from i16> t0, FrameIndex:i32<-1>, undef:i32 t40: i32 = and t41, Constant:i32<32768> t38: i16 = truncate t40 t32: i32 = X86ISD::CMP t38, Constant:i16<0> t34: ch = X86ISD::BRCOND t0, BasicBlock:ch<no 0x82ec648>, TargetConstant:i8<8>, t32 t18: ch = br t34, BasicBlock:ch<yes 0x82ec550>
The i32 = load (s16) does not match the pattern that generates AND32rm, and we generate two instructions instead of one.
Capitalize