This is an alternative to D58210 that achieves similar results for SystemZ. I'm proposing to extend the general (trunc (binop X, Y)) transform by using the existing TLI hook isTypeDesirableForOp() if we are post-legalization. This allows eliminating the similar fold in distributeTruncateThroughAnd() that is oddly constrained by starting the pattern match only from shift/rotate.
About the test diffs:
- AMDGPU: 'widen-smrd-loads' is an improvement and 'idot' diffs are regressions?
- PowerPC: neutral
- SystemZ: improvements or neutral
- x86: mostly neutral, improvements with 'shld/shrd', and regressions for 'vector-sext'.
I did look at the x86 'vector-sext' regressions, and the seemingly unnecessary 'movzbl' are being inserted by an isel pattern because:
// anyext. Define these to do an explicit zero-extend to // avoid partial-register updates.
So that conflicts with the x86 setting that says 8-bit ops are desirable. Ideally, we would defer partial-reg optimizations to a later pass (and I know we already do this for some cases, so maybe that just needs to be adjusted a bit).
Due to the lack of register bank awareness, the AMDGPU answer depends on whether the source node is divergent. This needs to pass the node itself, not just the opcode and type to answer properly