This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] look through bitcasts when trying to narrow vector binops
ClosedPublic

Authored by spatel on Nov 11 2018, 8:29 AM.

Details

Summary

This is another step in vector narrowing - a follow-up to D53784 (and hoping to eventually squash potential regressions seen in D51553).

Most of the x86 test diffs are wins, but the AArch64 diff is probably not. Do we need the target hook to distinguish between "isExtractSubvectorCheap" and "isExtractSubvectorFree"?

This problem probably already exists independent of this patch, but it went unnoticed in the previous patch because there were no regression tests that showed that possibility.

The x86 diff in i64-mem-copy.ll is also close. Given the frequency throttling concerns with using wider vector ops, I think an extra extract to reduce vector width is a reasonable trade-off at this level of codegen, but if we're going to refine the target hook for AArch, we might adjust the x86 override too.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Nov 11 2018, 8:29 AM
spatel added inline comments.Nov 11 2018, 8:40 AM
test/CodeGen/AArch64/arm64-ld1.ll
918–919 ↗(On Diff #173558)

Side note for the ARM folks - I think this applies here?

UXTL{2} <Vd>.<Ta>, <Vn>.<Tb>
is equivalent to
USHLL{2} <Vd>.<Ta>, <Vn>.<Tb>, #0
and is the preferred disassembly...

The X86 code looks reasonable to me - not sure how far the rabbit hole we want to go with 'free vs cheap' - especially as 'cheap' can be so relative, it'll quickly end up needing a cost model......

RKSimon added a subscriber: t.p.northover.

@t.p.northover The aarch64 regression looks like a failure in performAddSubLongCombine?

Ping (AArch64 expertise requested).

It's probably okay to canonicalize the way you are, but you're hitting a missing pattern for AArch64. Something like the following appears to work:

def : Pat<(sub (extract_subvector (zext v8i8:$LHS), (i64 0)),
               (extract_subvector (zext v8i8:$RHS), (i64 0))),
          (EXTRACT_SUBREG (USUBLv8i8_v8i16 v8i8:$LHS, v8i8:$RHS), dsub)>;

Of course, needs to be rewritten to to match all the relevant types and operations. x86 doesn't really have those sort of operations, I guess?

(performAddSubLongCombine probably also should be extended, but that's not what you're seeing.)

test/CodeGen/AArch64/arm64-ld1.ll
918–919 ↗(On Diff #173558)

Not sure why the alias isn't getting automatically applied; please file a bug.

test/CodeGen/X86/i64-mem-copy.ll
95 ↗(On Diff #173558)

This appears to be one instruction more... but maybe worth avoid 256-bit operations on x86?

spatel marked 2 inline comments as done.Nov 20 2018, 8:38 AM

It's probably okay to canonicalize the way you are, but you're hitting a missing pattern for AArch64. Something like the following appears to work:

def : Pat<(sub (extract_subvector (zext v8i8:$LHS), (i64 0)),
               (extract_subvector (zext v8i8:$RHS), (i64 0))),
          (EXTRACT_SUBREG (USUBLv8i8_v8i16 v8i8:$LHS, v8i8:$RHS), dsub)>;

Of course, needs to be rewritten to to match all the relevant types and operations. x86 doesn't really have those sort of operations, I guess?

Filed here:
https://bugs.llvm.org/show_bug.cgi?id=39722

Given that it's an existing bug, there's probably not much incentive to make this patch dependent on that getting fixed?

test/CodeGen/AArch64/arm64-ld1.ll
918–919 ↗(On Diff #173558)
test/CodeGen/X86/i64-mem-copy.ll
95 ↗(On Diff #173558)

Right - this is the test I mentioned in the initial summary. Given the current HW implementation choices (frequency throttling based on count of vector ops), I think this is the preferred form despite the extra instruction.

efriedma accepted this revision.Nov 20 2018, 11:28 AM

Yes, we don't need to block this fix, I think; LGTM

This revision is now accepted and ready to land.Nov 20 2018, 11:28 AM
This revision was automatically updated to reflect the committed changes.
spatel marked 2 inline comments as done.