This is an archive of the discontinued LLVM Phabricator instance.

merge vector stores into wider vector stores and fix AArch64 misaligned access TLI hook (PR21711)
ClosedPublic

Authored by spatel on Sep 4 2015, 8:36 AM.

Details

Summary

This is a redo of D7208 ( r227242 - http://llvm.org/viewvc/llvm-project?view=revision&revision=227242 ).

The patch was reverted because an AArch64 target could infinite loop after the change in DAGCombiner to merge vector stores. That happened because AArch64's allowsMisalignedMemoryAccesses() wasn't telling the truth. It reported all unaligned memory accesses as fast, but then split some 128-bit unaligned accesses up in performSTORECombine() because they are slow.

This patch attempts to fix the problem in allowsMisalignedMemoryAccesses() while preserving existing lowering behavior for AArch64.

Diff Detail

Event Timeline

spatel updated this revision to Diff 34033.Sep 4 2015, 8:36 AM
spatel retitled this revision from to merge vector stores into wider vector stores and fix AArch64 misaligned access TLI hook (PR21711).
spatel updated this object.
spatel added a subscriber: llvm-commits.
arsenm added a subscriber: arsenm.Sep 11 2015, 12:37 PM
arsenm added inline comments.
lib/Target/AArch64/AArch64ISelLowering.cpp
804

Pedantry: using VT.getStoreSize() would be preferable

test/CodeGen/AArch64/merge-store.ll
30

Can you add some more test cases with different combination sized vectors? I would like to see a testcase that might try to combine multiple 3x vectors

spatel added inline comments.Sep 11 2015, 2:30 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
804

Ok.

Note: I was misled by the 'store' in the name at first since we're handling both loads and stores here, but I see now that that function is not really about stores specifically.

test/CodeGen/AArch64/merge-store.ll
30

I've been trying to find a way to do this, but all attempts so far have been thwarted because things like a v3f32 are not a simple type, so MergeConsecutiveStores doesn't get very far. Do you have a specific pattern that you're thinking of? Keep in mind that this patch limits vector merging only to extracted elements of a vector, so we're not even handling loads yet.

arsenm added inline comments.Sep 11 2015, 2:35 PM
test/CodeGen/AArch64/merge-store.ll
30

I was thinking in case you had something like load v3i32, load v3i32 and those might d be legalized
into v2i32, i32, v2i32, i32 so you would have mixed scalar and vector loads to worry about.

I forgot that MergeConsecutiveStores currently only runs before types are legal, so this won't really do anything for now, although I would eventually like to run it again later and to handle 3 vectors

spatel added inline comments.Sep 11 2015, 2:36 PM
test/CodeGen/AArch64/merge-store.ll
30

Also note that we don't handle merging different sized types in getStoreMergeAndAliasCandidates(). I filed PR24654 ( https://llvm.org/bugs/show_bug.cgi?id=24654 ) related to that.

arsenm added inline comments.Sep 11 2015, 2:40 PM
test/CodeGen/AArch64/merge-store.ll
30

D12698 may help with that specific problem

spatel updated this revision to Diff 34591.Sep 11 2015, 2:46 PM

Patch updated based on Matt's feedback:

  1. Use getStoreSize() rather than getSizeInBits().
  2. I haven't found any interesting new test cases where this will fire because we're still quite limited in what we try to merge, so no new test cases vs. previous revision of this patch.
arsenm added inline comments.Sep 20 2015, 11:50 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
808–811

Which extensions do you mean? I've been looking for a way to specify alignment of vector loads from C, but nothing I've tried seems to work.

However, using the existence of this to justify reporting a different alignment as fast seems suspect.

spatel added inline comments.Sep 20 2015, 12:22 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
808–811

I agree that this looks hacky (along with the comment about optimizing for benchmarks), but the comments and code are copied directly from the existing performSTORECombine() (see around line 8476).

I don't want to alter the existing Aarch logic for this patch (other than to fix the obviously broken allowsMisalignedMemoryAccesses() implementation to allow the vector merging in DAGCombiner).

scanon added a subscriber: scanon.Sep 25 2015, 10:45 AM
ab accepted this revision.Sep 25 2015, 11:44 AM
ab added a reviewer: ab.
ab added a subscriber: ab.

LGTM

lib/Target/AArch64/AArch64ISelLowering.cpp
808–811

I guess this refers to something like:

typedef int __attribute__((ext_vector_type(4))) v4i32;
typedef v4i32 __attribute__((aligned(2))) v4i32_a2;

v4i32 foo(v4i32 *p) {
  v4i32_a2 *p2 = p;
  return *p2;
}

Interestingly, this generates a naturally aligned load:

typedef int __attribute__((ext_vector_type(4))) v4i32;
typedef v4i32 __attribute__((aligned(2))) v4i32_a2;

v4i32 foo(v4i32 *p) {
  return *(v4i32_a2 *)p;
}
This revision is now accepted and ready to land.Sep 25 2015, 11:44 AM
arsenm added inline comments.Sep 25 2015, 11:51 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
808–811

Looks like a bug to me

scanon added inline comments.Sep 25 2015, 1:01 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
808–811

That's expected, because p has 16B alignment; that alignment doesn't go away just because you cast the pointer to a less-aligned type.

To get an unaligned load, you do something like:

typedef int __attribute__((ext_vector_type(4))) v4i32;
typedef int __attribute__((ext_vector_type(4), aligned(4))) v4i32_a4;
v4i32 foo(const int *p) {
  return *(const v4i32_a4 *)p;
}
scanon added inline comments.Sep 25 2015, 1:13 PM
test/CodeGen/X86/MergeConsecutiveStores.ll
482

Combining these stores is not an unambiguous win.

With 16B alignment (as here), you're introducing a cacheline-crossing penalty where there otherwise would not be one; even with unknown alignment, Intel's optimization manual recommends using two 16B stores, rather than a 32B store (11.6.2). Does anyone from Intel want to comment on the considerations here?

Please see the discussion in D12154.

I don't think we take cacheline-crossing penalties into account anywhere in the compiler. Ie, we produce unaligned accesses for all x86 targets when we can merge smaller ops together to reduce the instruction count.

Note that we do have CPU attributes (eg, FeatureSlowUAMem32) that change this behavior; see unaligned-32-byte-memops.ll for examples of how that works.

This revision was automatically updated to reflect the committed changes.
ab added inline comments.Sep 25 2015, 4:54 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
808–811

That's expected, because p has 16B alignment; that alignment doesn't go away just because you cast the pointer to a less-aligned type.

You're right, but can't we argue the opposite? If I want to make the alignment go away, isn't attribute((aligned)) the best (or at least a very obvious) tool at my disposal?

In any case, I guess we should most importantly match gcc's behavior, and from what I see this changed since ~r246705, possibly as unintentional fallout of the recent alignment tracking improvements: filed PR24944 to look into this.