Hi All,
This patch gives constant vectors the big endian element order.
This is important for compiler generated vector constants, as done by clang vector optimizations.
Please review.
Thanks,
Christian
Differential D3305
AArch64: big endian constant vector pools cpirker on Apr 7 2014, 2:31 AM. Authored by
Details
Hi All, This patch gives constant vectors the big endian element order. Thanks,
Diff Detail Event TimelineComment Actions Hi Christian, Sorry for the delay, I was at the conference. I'm afraid this patch is going in the wrong direction. LLVM's IR puts the lowest addressed element of a vector in index 0 E.g. @var = global <4 x i16> <i16 32, i16 33, i16 34, i16 35> %val = load <4 x i16>* @var %diff = sub <4 x i16> %val, <4 x i16> < i16 32, i16 33, i16 34, i16 35 > ret <4 x i16> %val } With this patch (+ the one to disable LD1) this code will no longer Can you tell us what C-level problem you're trying to solve? I can Cheers. Tim. Comment Actions Hi Tim, I believe your example should read as follows: @var = global <4 x i16> <i16 32, i16 33, i16 34, i16 35> define <4 x i16> @foo() { %val = load <4 x i16>* @var %diff = sub <4 x i16> %val, < i16 32, i16 33, i16 34, i16 35 > ret <4 x i16> %diff } The problem here lies in the ldr load of the variable, which is then used as a vector (that would be a separate issue). You may check the following c-test case that gets vectorized by clang, thus generating vector instructions: #include <stdlib.h> #include <stdio.h> #define SIZE (26) __attribute__ ((noinline)) int SumArray(int Array[][SIZE], unsigned int NumI, unsigned int NumJ) { unsigned i, j; int Result = 0; for (i = 0; i < NumI; i++) for (j = 0; j < NumJ; j++) Result += Array[i][j]; return Result; } __attribute__ ((noinline)) void Init1(int Array[][SIZE] ) { unsigned i; for (i = 0; i < SIZE; i++) Array[i][i] = -i; } __attribute__ ((noinline)) void Init2(int Array[][SIZE]) { unsigned int i, j; for (i = 0; i < SIZE; i++) for (j = 0; j < SIZE; j++) if (j != i) Array[i][j] = i+j; } int main() { int Array[SIZE][SIZE]; unsigned int i, j; Init1( Array ); Init2( Array ); printf("Sum(Array[%d,%d]) = %d\n", SIZE, SIZE, SumArray(Array, SIZE, SIZE)); return 0; } Thanks, Comment Actions
It doesn't: your other patches are canonicalising on using ldr/str Also, you can't reverse the .hword entries in the global because That leaves your suggested constpool change as the problem.
This is rather a large example to illustrate a point on this scale. I So at the moment, I'm afraid I can't tell you what the real issue with Cheers. Tim. Comment Actions Hi Christian, I think Tim is correct. There are a couple of different solution for big To favor both compiler optimizations and end users' data exchange, we need For non-strict-alignment mode, we should always use ldr/str, which could For strict-alignment mode we need to introduce ld1/st1+rev for unaligned Thanks, Comment Actions Hi Tim, your proposed patch is interesting, however does not fully address the issue. define void @foo( <4 x i64>* %loadaddr, <4 x i32>* %storeaddr ) { %1 = load <4 x i64>* %loadaddr %2 = add <4 x i64> %1, <i64 1, i64 2, i64 3, i64 4> %3 = trunc <4 x i64> %2 to <4 x i32> store <4 x i32> %3, <4 x i32>* %storeaddr ret void } compile with: llc -march aarch64_be -mattr neon Note that the const value is loaded with ldr q and used by vector instruction, thus swapping of the two upper and the two lower vector elements. (I would like to note that D3345 does not help either, causing all vector elements swapped, but that's a different issue). I'm looking forward to your ideas. Thanks, Comment Actions
I see. I was worried there would be things like this: the remapping define <4 x i32> @foo(<2 x i32> %l, <2 x i32> %r) { %res = shufflevector <2 x i32> %l, <2 x i32> %r, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ret <4 x i32> %res } All our patterns assume that the LHS of a concat goes in lanes Now, onto solutions...
ld1/st1 for the in-register representation. As before, the trade-offs
nodes into IsLE and IsBE variants. That's the sledge-hammer solution,
ISelLowering that reversed it in the BE cases. This probably has nasty
probably a very bad idea, but I'll mention it for completeness. The I can't think of any way to make this problem go away with clever Cheers. Tim. Comment Actions I fully agree to your assessment and prefer solution 1 as well. Would you please elaborate on how to you imagine implement it? Cheers, Comment Actions
It wasn't necessarily a recommendation at that stage, just one of the But, implementing it:
stores rather than ldr/str (stack spills & fills are fine with ldr/str
AArch64InstrNEON.td, line 6556) and write another set, where every
LowerCallResult so that every type enters and exits the boundary as
.cpp), deciding whether it's OK or implicitly assuming little-endian + AArch64ISelLowering.cpp:2811 appears completely redundant. + LowerVectorSELECT_CC looks plausibly OK, but I'd want to make sure with testing. + getVShiftImm looks dodgy. Actually it looks highly suspect even as a little-endian optimisation. + Uses just after creating NEON_MOVIMM & NEON_MVNIMM nodes look suspect: they seem to want an actual NOP conversion. I'd probably + Code just after AArch64ISelLowering.cpp:4723 looks very likely LE specific. + Lots of .td uses act on vectors containing all 0 or all 1, these are clearly fine. + Ones involving movi/mvn mentioned above. + Not quite sure about the Neon_combine_2D ones, but the bitcasts to v1i64 look iffy. At the moment I can't even quite see where they're Obviously this would all be a gradual process, rather than one big patch. Cheers. Tim. |