This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] narrow vector binops when extraction is cheap
ClosedPublic

Authored by spatel on Oct 26 2018, 4:26 PM.

Details

Summary

Narrowing vector binops came up in the demanded bits discussion in D52912.

I don't think we're going to be able to do this transform in IR as a canonicalization because of the risk of creating unsupported widths for vector ops, but we already have a DAG TLI hook to allow what I was hoping for: isExtractSubvectorCheap(). This is currently enabled for x86, ARM, and AArch64 (although only x86 has existing regression test diffs).

This is artificially limited to not look through bitcasts because there are so many test diffs already, but that's marked with a TODO and is a small follow-up.

Diff Detail

Event Timeline

spatel created this revision.Oct 26 2018, 4:26 PM
RKSimon added inline comments.Oct 27 2018, 12:02 AM
test/CodeGen/X86/avx2-schedule.ll
238

This needs looking at - its supposed to test the schedule of the vextracti128 rr/mr ops - the add/sub are purely there to keep the correct domain. If all else fails you can convert it to inline asm.

spatel added inline comments.Oct 27 2018, 9:50 AM
test/CodeGen/X86/avx2-schedule.ll
238

Updated here:
rL345455

spatel updated this revision to Diff 171403.Oct 27 2018, 9:57 AM
spatel edited the summary of this revision. (Show Details)

Patch updated:
No code changes; made the avx2-sched.ll test file independent of this patch.

RKSimon added inline comments.Oct 27 2018, 12:21 PM
test/CodeGen/X86/vector-reduce-xor.ll
236

There's still a lot of logic ops happening at full width - is this to do with the bitcasts?

spatel added inline comments.Oct 28 2018, 7:59 AM
test/CodeGen/X86/vector-reduce-xor.ll
236

Yes:

t5: v8i32 = xor t2, t30
              t24: v4i64 = bitcast t5
            t34: v2i64 = extract_subvector t24, Constant:i64<0>

...so we should pick that up in the follow-up, or if we want to go big, I can add those diffs to this patch.

spatel added inline comments.Oct 29 2018, 7:04 AM
test/CodeGen/X86/vector-reduce-xor.ll
236

Looking a bit closer here, the bitcast enhancement is probably not enough. If there are multiple uses of a binop, we have to be careful, or we'll break a single wide op into multiple narrow ops.

There are still several patches to go before we get optimal codegen for reductions, so I'd rather not risk it by adding more patterns to this basic patch.

RKSimon accepted this revision.Oct 30 2018, 4:41 AM

LGTM as a first step

This revision is now accepted and ready to land.Oct 30 2018, 4:41 AM
This revision was automatically updated to reflect the committed changes.

This change appears to have injected a regression into Halide -- as of this change, we are no longer correctly generating psubusb/psubusw instructions in Halide-generated code.

@steven-johnson do you have an IR test case? There might be some X86 specific change I can make.