This is an archive of the discontinued LLVM Phabricator instance.

[ARM/AArch64] Support wide interleaved accesses
ClosedPublic

Authored by mssimpso on Feb 2 2017, 11:00 AM.

Details

Summary

This patch teaches (ARM|AArch64)ISelLowering.cpp to match illegal vector types to interleaved access intrinsics as long as the types are multiples of the vector register width (128 bits). A "wide" access will now be mapped to multiple interleave intrinsics similar to the way in which non-interleaved accesses with illegal types are legalized into multiple accesses. For example, given an interleaved access whose sub-vectors are 256 bits wide, the patch would generate 2 consecutive interleaved memory accesses.

The primary motivation is the vectorization of "mixed-type" loops, such as the one shown below.

f(char *A, int *B, unsigned N) {
  for (unsigned i = 0; i < N; i += 3) {
    B[i + 0] = A[i + 0]
    B[i + 1] = A[i + 1]
    B[i + 2] = A[i + 2]
  }
}

Here, we load char data (i8) and then store it as int data (i32). We'd like to set the loop vectorization factor based on the smaller type, rather than the larger one (we can do this today using the -vectorizer-maximize-bandwidth flag). Let the vectorization factor be 16 in this case for the <16 x i8> data. If we do this, the stored vector type becomes wider than is legal. If we had stride-one accesses this is fine - type legalization will split it up. But for the interleaved accesses we have here, we currently won't be able to map what the vectorizer generates to the proper interleave intrinsics because the type is too wide. Please see the test cases for more concrete examples.

I'll update the associated TTI costs (in getInterleavedMemoryOpCost) as a follow-on patch.

Event Timeline

mssimpso created this revision.Feb 2 2017, 11:00 AM
evandro added a subscriber: evandro.Feb 6 2017, 3:09 PM
rengolin edited edge metadata.Feb 18 2017, 4:30 PM

Hi Matthew,

This change seems pretty straight forward. I only have some silly comments inline and a simple request: to create one additional test for three LD2s (even/odd extract from <24 x i32>) on each arch, just to make sure that it scales.

cheers,
--renato

lib/Target/AArch64/AArch64ISelLowering.cpp
7281

I almost missed the modulo check above. To make it easier for posterity, I'd recommend move the 128 check after this comment.

7414

Now that I see repeated, this could be a simple inline function?

7486

this comment change is unnecessary? was that clang-format?

mssimpso updated this revision to Diff 89254.Feb 21 2017, 11:42 AM
mssimpso marked 3 inline comments as done.

Addressed Renato's comments.

  • Moved the "legalize" comment before the modulo check
  • Added a helper function for determining the number of accesses for a wide type
  • Unformatted a comment (clang-format still wrapped the last line)
  • Added a 3xLD2 test case (<24 x i32>) for both ARM and AArch64.
rengolin accepted this revision.Feb 21 2017, 12:41 PM

LGTM, thanks!

This revision is now accepted and ready to land.Feb 21 2017, 12:41 PM
This revision was automatically updated to reflect the committed changes.