This is an archive of the discontinued LLVM Phabricator instance.

AArch64: big endian constant vector pools
Needs ReviewPublic

Authored by cpirker on Apr 7 2014, 2:31 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

Hi All,

This patch gives constant vectors the big endian element order.
This is important for compiler generated vector constants, as done by clang vector optimizations.
Please review.

Thanks,
Christian

Diff Detail

Event Timeline

Hi Amara,

can you please review.

Thanks,
Christian

Hi Christian,

Sorry for the delay, I was at the conference.

I'm afraid this patch is going in the wrong direction.

LLVM's IR puts the lowest addressed element of a vector in index 0
(regardless of endianness), and I think this applies equally to the
BUILD_VECTORs you're analysing so reversing them will give an
incorrect constant pool.

E.g.

@var = global <4 x i16> <i16 32, i16 33, i16 34, i16 35>
define <4 x i16> @foo() {

%val = load <4 x i16>* @var
%diff = sub <4 x i16> %val, <4 x i16> < i16 32, i16 33, i16 34, i16 35 >
ret <4 x i16> %val

}

With this patch (+ the one to disable LD1) this code will no longer
produce a vector of zeroes. And to maintain compatibility with the LDR
choice, that's because *neither* needs reversing, rather than because
*both* do.

Can you tell us what C-level problem you're trying to solve? I can
hopefully suggest a more consistent change.

Cheers.

Tim.

Hi Tim,

I believe your example should read as follows:

@var =  global <4 x i16> <i16 32, i16 33, i16 34, i16 35>
define <4 x i16> @foo() {
  %val = load <4 x i16>* @var
  %diff = sub <4 x i16> %val, < i16 32, i16 33, i16 34, i16 35 >
  ret <4 x i16> %diff
}

The problem here lies in the ldr load of the variable, which is then used as a vector (that would be a separate issue).
The initialization of the immediate vector is also done via ldr, however on a reversed stored memory vector.
This leaves a value in q0 that can be later operated via vector instructions. That is a result of this patch.

You may check the following c-test case that gets vectorized by clang, thus generating vector instructions:

#include <stdlib.h>
#include <stdio.h>
#define SIZE (26)

__attribute__ ((noinline)) int SumArray(int Array[][SIZE], unsigned int NumI, unsigned int NumJ) {
  unsigned i, j;
  int Result = 0;
  for (i = 0; i < NumI; i++)
    for (j = 0; j < NumJ; j++)
      Result += Array[i][j];
  return Result;
}

__attribute__ ((noinline)) void Init1(int Array[][SIZE] ) {
  unsigned i;
  for (i = 0; i < SIZE; i++)
    Array[i][i] = -i;
}

__attribute__ ((noinline)) void Init2(int Array[][SIZE]) {
  unsigned int i, j;
  for (i = 0; i < SIZE; i++)
    for (j = 0; j < SIZE; j++)
      if (j != i)
        Array[i][j] = i+j;
}

int main() {
  int Array[SIZE][SIZE];
  unsigned int i, j;

  Init1( Array );
  Init2( Array );

  printf("Sum(Array[%d,%d]) = %d\n", SIZE, SIZE, SumArray(Array, SIZE, SIZE));
  return 0;
}

Thanks,
Christian

 @var =  global <4 x i16> <i16 32, i16 33, i16 34, i16 35>
 define <4 x i16> @foo() {
   %val = load <4 x i16>* @var
   %diff = sub <4 x i16> %val, < i16 32, i16 33, i16 34, i16 35 >
   ret <4 x i16> %diff
 }

The problem here lies in the ldr load of the variable, which is then used as a vector (that would be a separate issue).

It doesn't: your other patches are canonicalising on using ldr/str
instead of ld1/st1 to define LLVM's register-internal representation
of a vector. Using the ldr is perfectly sound.

Also, you can't reverse the .hword entries in the global because
that's entirely IR-level, and we've decided that the element at index
0 always has the lowest address. It's easy to come up with IR examples
that assume this and break if you try that transformation.

That leaves your suggested constpool change as the problem.

You may check the following c-test case that gets vectorized by clang, thus generating vector instructions:

This is rather a large example to illustrate a point on this scale. I
only see one block in SumArray vectorised by trunk Clang (-target
aarch64_be-linux-gnu -O3). It doesn't seem to contain any BUILD_VECTOR
nodes.

So at the moment, I'm afraid I can't tell you what the real issue with
that is. We're not helped by ld1/st1 still being generated. The sooner
that issue can be resolved the happier I'll be that we're all talking
about the same thing.

Cheers.

Tim.

Hi Christian,

I think Tim is correct. There are a couple of different solution for big
endian, but we have to choose one of them that could meet all requirements
and make our life easier.

To favor both compiler optimizations and end users' data exchange, we need
to keep the same in-memory layout for both little endian and big endian,
i.e. index 0 is always at low address in memory. The constant vector on LVM
IR should finally map to literal pool. Therefore, we shouldn't reverse the
element order of constant vector on LLVM IR.

For non-strict-alignment mode, we should always use ldr/str, which could
meets requirement for both correctness and performance.
Non-strict-alignment mode is the most common case.

For strict-alignment mode we need to introduce ld1/st1+rev for unaligned
access to avoid hardware exception.

Thanks,
-Jiangning

Hi Tim,

your proposed patch is interesting, however does not fully address the issue.
Please consider the following code (derived from the c-sample above):

define void @foo( <4 x i64>* %loadaddr, <4 x i32>* %storeaddr ) {
  %1 = load <4 x i64>* %loadaddr
  %2 = add <4 x i64> %1, <i64 1, i64 2, i64 3, i64 4>
  %3 = trunc <4 x i64> %2 to <4 x i32>
  store <4 x i32> %3, <4 x i32>* %storeaddr
  ret void
}

compile with:

llc -march aarch64_be -mattr neon

Note that the const value is loaded with ldr q and used by vector instruction, thus swapping of the two upper and the two lower vector elements. (I would like to note that D3345 does not help either, causing all vector elements swapped, but that's a different issue). I'm looking forward to your ideas.

Thanks,
Christian

  define void @foo( <4 x i64>* %loadaddr, <4 x i32>* %storeaddr ) {
    %1 = load <4 x i64>* %loadaddr
    %2 = add <4 x i64> %1, <i64 1, i64 2, i64 3, i64 4>
    %3 = trunc <4 x i64> %2 to <4 x i32>
    store <4 x i32> %3, <4 x i32>* %storeaddr
    ret void
  }

Note that the const value is loaded with ldr q and used by vector instruction, thus swapping of the two upper and the two lower  vector elements.

I see. I was worried there would be things like this: the remapping
SDNodeXForm I gave takes care of (most) explicit vector indices, but
there are still a whole bunch of implicit ones too. In this case,
concat_vectors, and it has an even simpler problematic example:

define <4 x i32> @foo(<2 x i32> %l, <2 x i32> %r) {

%res = shufflevector <2 x i32> %l, <2 x i32> %r, <4 x i32> <i32 0,

i32 1, i32 2, i32 3>

ret <4 x i32> %res

}

All our patterns assume that the LHS of a concat goes in lanes
[0..N-1] and the RHS goes in [N..2N-1]. This is no longer correct in
the LDR-based big-endian world. We probably have to add to the list:
extract_subvector and scalar_to_vector. I can't *think* of any others,
but that doesn't mean they don't exist. Almost certainly any
instruction ending in a "2" is going to be affected.

Now, onto solutions...

  1. It's not necessarily too late to switch to canonicalising on

ld1/st1 for the in-register representation. As before, the trade-offs
are: bitcast is odd; function calls & returns are odd; vector indices
are normal though. Probably less efficient code overall. Overall,
implementation is probably quite a lot simpler.

  1. We could obviously split every pattern involving the problematic

nodes into IsLE and IsBE variants. That's the sledge-hammer solution,
and is very ugly.

  1. We could create a new AArch64ISD::CONCAT_LOHI or something during

ISelLowering that reversed it in the BE cases. This probably has nasty
implications for DAGCombiner so it's not really a good idea.

  1. Swapping the arguments to all concat_vectors during ISelLowering is

probably a very bad idea, but I'll mention it for completeness. The
DAGCombiner will not be expecting that switch to have happened and
will almost certainly get it wrong (e.g. by combining
"(extract_element (concat_vectors L, R), 0)" to "(extract_element L,
0)").

I can't think of any way to make this problem go away with clever
patterns at the moment.

Cheers.

Tim.

I fully agree to your assessment and prefer solution 1 as well. Would you please elaborate on how to you imagine implement it?

Cheers,
Conny

I fully agree to your assessment and prefer solution 1 as well. Would you please elaborate on how to you imagine implement it?

It wasn't necessarily a recommendation at that stage, just one of the
possible options. At least Christian should probably be involved in
the discussion too. And, looking at the final list of BITCAST uses
there may not be much difference in the eventual complexity either
way.

But, implementing it:

  1. Make sure ld1/st1 are always used for big-endian vector loads &

stores rather than ldr/str (stack spills & fills are fine with ldr/str
because they're invisible to user code).

  1. Disable existing pure "bitconvert" patterns on big-endian (around

AArch64InstrNEON.td, line 6556) and write another set, where every
non-trivial bitcast maps to a pair of REV instructions.

  1. Modify LowerFormalArguments, LowerReturn, LowerCall and

LowerCallResult so that every type enters and exits the boundary as
v16i8/v8i8 and gets immediately BITCAST to the desired type.

  1. Go through every *other* use of bitconvert (in .td) and BITCAST (in

.cpp), deciding whether it's OK or implicitly assuming little-endian
behaviour. At a glance (needs checking):

+ AArch64ISelLowering.cpp:2811 appears completely redundant.
+ LowerVectorSELECT_CC looks plausibly OK, but I'd want to make sure

with testing.

+ getVShiftImm looks dodgy. Actually it looks highly suspect even as

a little-endian optimisation.

+ Uses just after creating NEON_MOVIMM & NEON_MVNIMM nodes look

suspect: they seem to want an actual NOP conversion. I'd probably
switch NEON_MOVIMM etc to take an actual type argument if possible
instead of inferring it from the type of the node (i.e. instead of the
current "(v4i16 (bitcast (v2i32 (MOVIMM ...))))" you'd get "(v4i16
(MOVIMM ..., v2i32))"). Fixing this would take care of most of the
suspect .td entries too, and generally make the patterns neater.

+ Code just after AArch64ISelLowering.cpp:4723 looks very likely LE specific.
+ Lots of .td uses act on vectors containing all 0 or all 1, these

are clearly fine.

+ Ones involving movi/mvn mentioned above.
+ Not quite sure about the Neon_combine_2D ones, but the bitcasts to

v1i64 look iffy. At the moment I can't even quite see where they're
*created* though.

Obviously this would all be a gradual process, rather than one big patch.

Cheers.

Tim.