This is a follow-up to the change in D33578 that introduced this transform:
(extract_subvector (load wide vector)) --> (load narrow vector)
Details
Diff Detail
Event Timeline
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
14674 | This is condition looks like it's going to trigger differently on different indexes combining consecutive subvectors extracted from the same larger vector. This is why the vec_int_to_fp case still has a load to zmm0 (see other comment) It seems like what we'd like to do is check that all uses of Ld are cheap if there is more than one use (and then convert all uses simulatenously). That said, I think checking freeness/cheapness for each possible ExtIdxValue is the way to go. | |
test/CodeGen/X86/vec_int_to_fp.ll | ||
3669 | We're only partially converting the load-extracts here. there should only be a load to zmmX and extracts or 4 direct loads to xmmX. |
test/CodeGen/X86/vec_int_to_fp.ll | ||
---|---|---|
3669 | Agreed - that's what I meant in the description when I said that these diffs might be seen as bugs in isExtractSubvectorCheap(). In this case, x86 has made it cheap to extract from index 0 or one other index: return (Index == 0 || Index == ResVT.getVectorNumElements()); Clearly, this was only tested with cases where we are extracting a half-sized vector. So it misses 2 out of the N/4 possibilities for AVX512 in this test. I think this change is still an improvement (but not ideal of course), but my goal with this patch was really to answer the questions for the non-x86 diffs. I could just skip this step and post the more liberal patch with more test diffs if that seems better. |
It looks like most of the AMDGPU cases fail because:
- TLI.isExtractSubvectorCheap(VT, ExtIdxValue) is not defined for AMDGPU.
- Legalization breaks sign-/zero-extended vectors into a concat of smaller subvectors.
The former seems easy for someone who knows AMDGPU to correct.
Actually, I see another way out. I missed this TLI hook:
// Return true if it is profitable to reduce the given load node to a smaller // type. // // e.g. (i16 (trunc (i32 (load x))) -> i16 load x should be performed virtual bool shouldReduceLoadWidth(SDNode *Load, ISD::LoadExtType ExtTy, EVT NewVT) const { return true; }
This was originally added for AMDGPU (rL224084), so that should prevent the regressions.
Patch updated:
- Remove the one-use restriction.
- Add the TLI..shouldReduceLoadWidth() predicate.
So now we see the full effect on x86, sidestep the AMDGPU problems, but seem to have introduced some ARM regressions.
AFAICT, the x86 diffs are all wins. This includes an improvement to select non-temporal loads where we failed to do so before.
Patch updated:
Rebased after rL304718 - the AVX1 non-temporal isel got fixed there, so now we just see different scheduling in those tests.
The diffs to the ARM tests are clearly no good: you're splitting 128-bit vector loads into two 64-bit vector loads for no benefit.
You're generating fewer instructions on x86, but it's not obvious it's beneficial; you get rid of the EXTRACT_SUBVECTOR operations, but the end result is a lot more instructions with memory operands.
test/CodeGen/AArch64/arm64-vabs.ll | ||
---|---|---|
141 | We need to generate more complete checks for these tests... but I would guess this is adding extra instructions. |
Abandoning. It's too big of a change even with the predicating TLI hook. We've probably already improved some of the x86 tests with other patches.
This is condition looks like it's going to trigger differently on different indexes combining consecutive subvectors extracted from the same larger vector. This is why the vec_int_to_fp case still has a load to zmm0 (see other comment)
It seems like what we'd like to do is check that all uses of Ld are cheap if there is more than one use (and then convert all uses simulatenously). That said, I think checking freeness/cheapness for each possible ExtIdxValue is the way to go.