This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] loosen restriction for creating narrow vector load from extract(wide load)
AbandonedPublic

Authored by spatel on Jun 3 2017, 9:10 AM.

Details

Summary

This is a follow-up to the change in D33578 that introduced this transform:
(extract_subvector (load wide vector)) --> (load narrow vector)

Diff Detail

Event Timeline

spatel created this revision.Jun 3 2017, 9:10 AM
niravd added inline comments.Jun 3 2017, 8:53 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14674

This is condition looks like it's going to trigger differently on different indexes combining consecutive subvectors extracted from the same larger vector. This is why the vec_int_to_fp case still has a load to zmm0 (see other comment)

It seems like what we'd like to do is check that all uses of Ld are cheap if there is more than one use (and then convert all uses simulatenously). That said, I think checking freeness/cheapness for each possible ExtIdxValue is the way to go.

test/CodeGen/X86/vec_int_to_fp.ll
3579–3580

We're only partially converting the load-extracts here. there should only be a load to zmmX and extracts or 4 direct loads to xmmX.

spatel added inline comments.Jun 4 2017, 7:34 AM
test/CodeGen/X86/vec_int_to_fp.ll
3579–3580

Agreed - that's what I meant in the description when I said that these diffs might be seen as bugs in isExtractSubvectorCheap().

In this case, x86 has made it cheap to extract from index 0 or one other index:

return (Index == 0 || Index == ResVT.getVectorNumElements());

Clearly, this was only tested with cases where we are extracting a half-sized vector. So it misses 2 out of the N/4 possibilities for AVX512 in this test.

I think this change is still an improvement (but not ideal of course), but my goal with this patch was really to answer the questions for the non-x86 diffs. I could just skip this step and post the more liberal patch with more test diffs if that seems better.

tstellar edited edge metadata.Jun 5 2017, 2:24 AM

All of the AMDGPU test changes are regressions.

niravd edited edge metadata.Jun 5 2017, 6:43 AM

It looks like most of the AMDGPU cases fail because:

  • TLI.isExtractSubvectorCheap(VT, ExtIdxValue) is not defined for AMDGPU.
  • Legalization breaks sign-/zero-extended vectors into a concat of smaller subvectors.

The former seems easy for someone who knows AMDGPU to correct.

spatel added a comment.Jun 5 2017, 7:05 AM

It looks like most of the AMDGPU cases fail because:

  • TLI.isExtractSubvectorCheap(VT, ExtIdxValue) is not defined for AMDGPU.
  • Legalization breaks sign-/zero-extended vectors into a concat of smaller subvectors.

The former seems easy for someone who knows AMDGPU to correct.

Actually, I see another way out. I missed this TLI hook:

// Return true if it is profitable to reduce the given load node to a smaller
// type.
//
// e.g. (i16 (trunc (i32 (load x))) -> i16 load x should be performed
virtual bool shouldReduceLoadWidth(SDNode *Load,
                                   ISD::LoadExtType ExtTy,
                                   EVT NewVT) const {
  return true;
}

This was originally added for AMDGPU (rL224084), so that should prevent the regressions.

spatel updated this revision to Diff 101412.Jun 5 2017, 9:06 AM
spatel edited the summary of this revision. (Show Details)
spatel added a reviewer: t.p.northover.

Patch updated:

  1. Remove the one-use restriction.
  2. Add the TLI..shouldReduceLoadWidth() predicate.

So now we see the full effect on x86, sidestep the AMDGPU problems, but seem to have introduced some ARM regressions.

AFAICT, the x86 diffs are all wins. This includes an improvement to select non-temporal loads where we failed to do so before.

spatel updated this revision to Diff 101415.Jun 5 2017, 9:18 AM

Patch updated:
Rebased after rL304718 - the AVX1 non-temporal isel got fixed there, so now we just see different scheduling in those tests.

efriedma edited edge metadata.Jun 5 2017, 11:52 AM

The diffs to the ARM tests are clearly no good: you're splitting 128-bit vector loads into two 64-bit vector loads for no benefit.

You're generating fewer instructions on x86, but it's not obvious it's beneficial; you get rid of the EXTRACT_SUBVECTOR operations, but the end result is a lot more instructions with memory operands.

test/CodeGen/AArch64/arm64-vabs.ll
141

We need to generate more complete checks for these tests... but I would guess this is adding extra instructions.

arsenm resigned from this revision.Feb 21 2019, 6:57 PM

is this still relevant? abandon?

spatel abandoned this revision.Sep 3 2019, 5:36 AM

is this still relevant? abandon?

Abandoning. It's too big of a change even with the predicating TLI hook. We've probably already improved some of the x86 tests with other patches.