- Tablegen patterns exist to use 'xtn' and 'uzp1' for trunc [1]. Cost table entries are updated based on the actual number of {xtn, uzp1} instructions generated.
- Without this, an IR instruction like trunc <8 x i16> %v to <8 x i8> is considered free and might be sinked to other basic blocks. As a result, the sinked 'trunc' is in a different basic block with its (usually not-free) vector operand and misses the chance to be combined during instruction selection. (examples in [2])
- It's a lot of effort to teach CodeGenPrepare.cpp to sink the operand of trunc without introducing regressions, since the instruction to compute the operand of trunc could be faster (e.g., throughput) than the instruction corresponding to "trunc (bin-vector-op".
- For instance in [3], sinking %1 (as trunc operand) into bb.1 and bb.2 means to replace 2 xtn with 2 shrn (shrn has a throughput of 1 and only utilize v1 pipeline), which is not necessarily good, especially since ushr result needs to be preserved for store operation in bb.0. Meanwhile, it's too optimistic (for CodeGenPrepare pass) to assume machine-cse will always be able to de-dup shrn from various basic blocks into one shrn.
[1] Pattern for {v8i16->v8i8, v4i32->v4i16, v2i64->v2i32} and pattern for concat (trunc, trunc) -> uzp1
[2] pattern for trunc(umin(X, 255)) -> UQXTRN v8i8 and other {u,s}x{min,max} pattern for v8i16 operands), and pattern for shrn (v8i16->v8i8, v2i64->v2i32)
[3]
; instruction latency / throughput / pipeline on `neoverse-n1` bb.0: %1 = lshr <8 x i16> %10, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4> ; ushr, latency 2, throughput 1, pipeline V1 %2 = trunc <8 x i16> %1 to <8 x i8> ; xtn, latency 2, throughput 2, pipeline V %3 = store <8 x i8> %1, ptr %addr br cond i1 cond, label bb.1, label bb.2 bb.1: %4 = trunc <8 x i16> %1 to <8 x i8> ; xtn bb.2: %5 = trunc <8 x i16> %1 to <8 x i8> ; xtn
It looks like it is already 2 here: https://godbolt.org/z/T4rTqf1Tx. I guess it is not very reliable at the moment.